Trends of ICASSP 2022

Before we start
● ICASSP?
○ De-facto conference for audio/speech (with InterSpeech)
● 2022 ICASSP Stats
○ 3967 papers submitted, 1785 (45.0%) accepted
○ https://github.com/lixin4ever/Conference-Acceptance-Rate
○ About ~100 papers skimmed
● Slides
○ General topics & Service-related topics
○ Can’t go through everything, will try to ﬁnish quickly

Contents
1. General Trend
1. 2022 SotA
2. Contrastive / Self-supervised
3. Security
4. Post-COVID Teleconferencing
5. Applications
2. Topics related to our tasks
1. Multilingualism / Cross-lingualism
2. Keyword Spotting
3. Few-shot / Low-shot
4. Audio Augmentation
5. Federated Learning

General Models
● Wav2vec (Facebook)
○ Raw audio + 1D Convs + BERT + CTC Loss (Similar to MLM, feature distractors)
● Wav2vec 2.0 (Facebook)
○ Raw audio + 1D Convs + BERT + Codebooks + Contrastive loss (quantized distractors)
● HuBERT (Facebook)
○ Raw audio + 1D Convs + BERT + MFCC-based Clustering + Cluster Ensembles
● BigSSL/CAP12 (Google)
○ Spectrogram + 1D Convs + Conformers + Contrastive loss (quantized distractors)
● Data2vec (Facebook)
○ Raw audio + 1D Convs + BERT + Student-teacher (EMA) prediction
● WavLM (Microsoft)
○ HuBERT + Speech denoising + Gated relative position bias

GLUE-like Benchmarks
● SUPERB (NTU)
○ Speech processing Universal PERformance Benchmark https://superbbenchmark.org/
○ Recognition: Phoneme Recognition, Automatic Speech Recognition
○ Detection: Keyword Spotting, Query by Example Spoken Term Detection
○ Semantics: Intent Classification, Slot Filling, Speech Translation
○ Speaker: Speaker Identification, Automatic Speaker Verification, Speaker Diarization
○ Paralinguistics: Emotion Recognition
○ Generation: Speech enhancement, Speech Separation
● NOSS (Google)
○ NOn-Semantic Speech Benchmark
https://ai.googleblog.com/2020/06/improving-speech-representations-and.html
○ Speaker identification, Language identification, Command, Emotion, Dementia/healthy
● HARES (Deepmind)
○ Holistic audio representation evaluation suite
○ Environment: Audio tagging, Animal/Scene classification
○ Speech: Keyword Spotting, Intention Classification, Language identification, Speaker identification
○ Music: Instrument Identification, Pitch estimation, Music tagging

1.2 Contrastive / Self-supervised

Towards Learning Universal Audio Representations
(DeepMind)
● HARES: New BLEU-like benchmark
● SlowFast (from Video) + NFNet
(from Vision) seems to be great.
○ SlowFast: Two branches with
bigger/smaller kernel width
○ NFNet: Normalizer-Free ResNets
● CPC (contrastive learning) works
quite well.
https://www.notion.so/hpcnt/Towards-Learning-Universal-Audio-Repres
entations-ed8774b85de143c097175b3646cd84e1

Universal paralinguistic speech representations using
self-supervised conformers (Google)
● Contrastive learning (of w2v2) on Conformers
○ Future works already conducted on distilling this model (TRILLsson)
○ https://ai.googleblog.com/2022/03/trillsson-small-universal-speech.html
● Closely following previous work BigSSL (Google)
○ Trained with speech-heavy youtube videos
○ Their conclusion: SSL + Large Models are especially helpful for small datasets
● Best performance wasn’t from the
ﬁnal layer’s feature vector.
(Same conclusion from BigSSL)
→ CAP12 (12th layer feature outputs)
https://www.notion.so/hpcnt/Universal-paralinguistic-speech-representations-using-self-supervised-conformers-d621d75b95eb
4369ab34cc5237603393

A Noise-Robust Self-supervised Pre-training Model Based
Speech Representation Learning for Automatic Speech
Recognition
Making the w2v feature encoder robust to
additional noise via contrastive loss
https://www.notion.so/hpcnt/A-Noise-Robust-Self-supervised-Pr
e-training-Model-Based-Speech-Representation-Learning-for-Aut
omatic-537ec0ccbd874303840b582db90a3a9d

Wav2vec-switch: Contrastive learning from original-noisy
speech pairs for robust speech recognition (Microsoft)
● Contextualized representation being robust to noise
Original
w2v2 loss

DistilHuBERT: Speech representation learning by layer-wise
distillation of hidden-unit BERT (NTU)

Improving Self-Supervised Learning for Speech Recognition
with Intermediate Layer Supervision (Microsoft)
● The common practice of SSL is to compute the
self-supervised loss on the top layer, such as
wav2vec 2.0 and HuBERT.
● However, the lower layers of such a pre-trained
model is shown to have a low correlation with
phonetic information.
● In this work, we propose to apply intermediate
layer supervision to encourage lower layers to
learn content knowledge → Apply exact same
thing of HUBERT to lower layers

Exploring Heterogeneous Characteristics of Layers in ASR
Models for More Efﬁcient Training (Google)
● Based on “Are All Layers Created Equal?” (Bengio)
○ Fix intermediate layers’ weights to some other weights
○ Re-initialization: Come back to initial values
○ Re-randomization: Get random weights
● While ambient layers were present in all model sizes, we observed that larger
models had more ambient layers, i.e., overparameterized models.
● During early rounds, the ambient layers were more
spread throughout the model; only later
the separation become more distinct.
● GN was more robust (against re-random) than BN.

Investigation of Robustness of Hubert Features from
Different Layers to Domain, Accent and Language Variations
● Our experiments indicate that as domain, accent,
bandwidth and language deviates from the source domain,
the relative improvement decreases.
● The last layer of HuBERT is very speciﬁc to the dataset on
which it is trained. The second last layer seems to be better
when there is domain and accent differences.
● Middle layers are more suited when data is from a different
language.

Don't speak too fast: The impact of data bias on
self-supervised speech models (NTU)
● Use SUPERB benchmark to differ Gender, Content, Speech speed of
pre-trained datasets
● Gender → Adding few minor class samples will mitigate performance drop
● Content → Model didn’t care perplexity
● Speech speed → Faster speech is worse

Speech anonymization (Emmanuel Vincent)
● Speech information
○ Verbal content (identifiers, private info, etc)
○ Speaker (identity, gender, age, ethnic origin, etc)
○ Nonverbal content (emotion, health, etc)
○ Acoustic environment (acoustics, other speakers, etc)
● Risks
○ User profiling, user identification, voice cloning,
information leakage
● Methods
○ Embedded systems, Cryptography, Obfuscation, Anonymization, Federated Learning, etc
○ Simple modifications (ex. Pitch shifts) utterly fail for knowledgeable attackers
● Current speech anonymization challenge != Legal defn.
○ It seems that many big companies doesn’t anonymize speech (collected from various sources)
○ Task: (1) ASR (2) Emotion recognition

Preserving Trajectory Privacy in Driving Data Release
● What comes with the innovative services provided by intelligent transport
systems (ITS) are potential privacy attacks.
● For example, in trafﬁc monitoring systems, individual users send anonymized
personal location traces continuously to aid in trafﬁc state estimation.
● However, an adversary may link an anonymous GPS trace to a particular person
provided additional knowledge of the person’s residence or working location.
● This can not be achieved by data encryption or hiding the driver identity. We
resort to the notion of inference privacy that sanitizes raw data to limit the
amount of contained private information.

Audio Deepfake Detection 2022: the First Audio Deep
Synthesis Detection Challenge
● http://addchallenge.cn/
● Low-quality fake audio detection: focuses on dealing with bona ﬁde and fully
fake utterances with various real-world noises etc
○ Fully generated utterances
● Partially fake audio detection: distinguish the partially fake audio from the real
○ Generated by manipulating the genuine utterances
● Audio fake game: Solve both an audio generation task and an audio fake
detection task

Aasist: Audio anti-spoofing using integrated
spectro-temporal graph attention networks (Naver)
● Spoofing detection solutions can be an important consideration when
automatic speaker verification systems are deployed in real-world applications.
● Two major scenarios:
○ Logical access (LA): spoofing attacks mounted with voice conversion and TTS
○ Physical access (PA): bona fide utterances are captured and then replayed
● Recent studies show that discriminative information (i.e., spoofing artefacts)
can reside in specific temporal and spectral intervals

Characterizing the adversarial vulnerability of speech
self-supervised learning (Helen Meng)
● Speech processing Universal PERformance Benchmark (SUPERB)
○ Upstream model (self-supervised models) + Downstream models (directly uses features, ex.
ﬁnetuning)
● Adversarial Attacks
○ Limited-knowledge adversaries: Attackers can access the internals of the target model
(parameters and gradients). But they do not know which downstream task will be conducted.
○ Zero-knowledge adversaries: Target model is unavailable to the attackers. In such a case, the
substitute model is used for approximating gradients for adversarial sample generation.
○ XAB listening test: check if humans can distinguish adversarial samples
● Results: Attacks are effective, humans cannot easily distinguish.

Adversarial Sample Detection for Speaker Verification by
Neural Vocoders (Tencent)
● Automatic speaker verification (ASV), one of the most important technology for
biometric identification, has been widely adopted in security-critical
applications.
● However, ASV is seriously vulnerable to recently emerged adversarial attacks,
yet effective countermeasures against them are limited.

Source Mixing and Separation Robust Audio Steganography
(Sony)
● Audio steganography is the science of
concealing secret messages inside a host
audio called a carrier in such a way that the
concealment is unnoticeable to human ears.
● Recently, deep neural networks (DNNs) have
been used as a steganographic function for
hiding data inside images to achieve high
capacity.
● The network learns to conceal a hidden
message inside the carrier without manually
specifying a particular redundancy to exploit.
PixInWav: Residual Steganography for
Hiding Pixels in Audio

Exploiting language model for efficient linguistic steganalysis
● Linguistic steganography (LS)
○ Natural language is actually quite suitable for steganography.
○ The advantage is that LS can be easily concealed by the huge number of social activities.
○ (1) modification based and (2) generation based
○ Latter allows more data to be embedded
● Steganalysis = to detect whether there is secret data embedded in the media
● Significant difference between automatically generated stego texts and carrier
texts in terms of the conditional probability distribution of individual words.

Acoustic Echo Cancellation
● Acoustic echo refers to the phenomenon that occurs when a microphone picks
up the far-end signal that is played by a loudspeaker.
● This phenomenon can cause a slight annoyance or a signiﬁcant breakdown in a
communication system.
● ICASSP 2022 AEC Challenge by Microsoft
● Various scenarios
○ Long- or varying delays
○ Strong speaker/mic distortions
○ Stationary/non-stationary noise
○ Glitches (due to high CPU usages)
○ etc.

Deep Noise Suppression
● Audio calls in the presence of
background noises get
signiﬁcantly degraded in terms
of quality/intelligibility of the
perceived speech.
● ICASSP 2022 Deep Noise
Suppression Challenge by
Microsoft

Multi-Channel Multi-Party Meeting Transcription
● Speaker Diarization
○ Partitioning an input audio stream into homogeneous segments according to the speaker
identity, i.e. "who spoke when?”
● Multi-speaker ASR
○ Hard to do overlapped speech recognition due to the interfering speakers or background noise
● ICASSP 2022 M2MeT Challenge by Alibaba

VarArray: Array-geometry-agnostic continuous speech
separation (Microsoft)
● Continuous speech separation using a microphone array was shown to be
promising in dealing with the speech overlap problem.
● Signals highly depend on the position of the microphones.
● In meetings, we can assume only two or fewer speakers to be active for the
majority of the meeting time.

Multimodal Systems
● Audio-Visual Object Classiﬁcation For Human-Robot Collaboration
● Multimodal Information Based Speech Processing
● Machine Translation for Spoken and Written Language
● Image and Video Understanding
● Multimodal Signal Processing, Analysis, and Synthesis
● Audio Security and Multi-Modal Systems
● Multi-modal Analysis and Synthesis
● Multimodal Data Fusion and Processing
● Multimodal Analysis in Audio Applications

Emotion Recognition
● Speech emotion recognition using self-supervised features
○ A modular End-to-End SER system based on an Upstream + Downstream architecture paradigm,
which allows easy use/integration of a large variety of self-supervised features.
● Memobert: Pre-training model with prompt-based learning for multimodal
emotion recognition
○ learns multimodal joint representations through self-supervised learning
○ prompt-based method that reformulates emotion classiﬁcation as a masked text prediction
● Multimodal Emotion Recognition with Surgical and Fabric Masks
○ investigate how mufﬂed speech and occluded facial expressions change the prediction of
emotions

Speech as a Disease Biomarker
● Fraug: A Frame Rate Based Data Augmentation Method for Depression Detection
from Speech Signals
○ Among others, the speech signal is an important biomarker of our mental state and can be collected
remotely, in a non-invasive manner with no expert supervision.
○ Recently, speech-based automatic diagnosis of depression has gained signiﬁcant momentum.
● Exploring Dementia Detection from Speech: Cross Corpus Analysis
○ Population aging is responsible for an increase of new Alzheimer’s disease (AD) cases, and creates the
need for scalable, cost-effective methods that are able to detect early stage AD.
○ Speech and language biomarkers are strong indicators of dementia, and provide a low-cost and
widespread alternative for the assessment of cognitive states.
● The Second Dicova Challenge: Dataset and Performance Analysis for Diagnosis of
Covid-19 Using Acoustics
○ Dataset of audio recordings consisting of breathing, cough and speech signals
○ Providing a point-of-care, rapid, easy to use, and cost-effective tool to help contain COVID-19 spread.

Voice Conversion
● Robust disentangled variational speech representation learning for zero-shot
voice conversion (Tencent)
○ Feeding an arbitrary speaker embedding and content embeddings to the VAE decoder
● Controllable Speech Representation Learning Via Voice Conversion and AIC
Loss (Adobe)
○ Its disentangled components (content, pitch, speaker identity, and energy) can be controlled
independently to alter the synthesis result.
● An Investigation of Streaming Non-Autoregressive sequence-to-sequence
Voice Conversion
● Voice Filter: Few-shot text-to-speech speaker adaptation using voice
conversion as a post-processing module (Amazon)
○ It uses voice conversion (VC) as a post-processing module appended to a pre-existing
high-quality TTS system, framing the few-shot TTS problem as a VC task.

Music Applications
● HiFi-SVC: Fast High Fidelity Cross-Domain Singing Voice Conversion
● Music Enhancement via Image Translation and Vocoding (Adobe)
● Source Separation By Steering Pretrained Music Models
● MELONS: generating melody with long-term structure using transformers and
structure graph
● Genre-Conditioned Long-Term 3D Dance Generation Driven by Music
● Deep Performer: Score-to-Audio Music Performance Synthesis (Dolby)
● SleepGAN: Towards Personalized Sleep Therapy Music (Nokia)
● Modeling beats and downbeats with a time-frequency Transformer (ByteDance)

Quantum Machine Learning
● Languages: Google Cirq / Microsoft Q# / IBM Qiskit
● Services: Google Quantum AI / Azure Quantum / IBM Quantum
● The dawn of quantum natural language processing
○ We successfully train a quantum-enhanced Long Short-Term Memory network to perform the
parts-of-speech tagging task via numerical simulations.
○ Practical applications are more likely to be a hybrid of classical and quantum operations. This
hybrid approach is not too different from what has been done in the past decade with GPUs.
○ The main idea behind Quantum Machine Learning (QML) is to replace parts of a neural network
(e.g. linear layers) with a quantum counterpart.
● Quantum federated learning with quantum data
○ Hybrid models fall short when dealing with the highly complex purely quantum data.
○ Thus, purely quantum ML models that can address these challenges were developed, such as
quantum neural networks (QNNs).
○ However, due to the fragile nature of the carriers of quantum data, i.e., qubits, there is a natural
need for distributed learning solutions such as federated learning (FL).

Machine Learning is All You Need
● Audio Representations
○ Learnable Wavelet Packet Transform for Data-Adapted Spectrograms
● Encodings
○ A Low-Parametric Model for Bit-Rate Estimation of VVC Residual Coding
○ Low-Complexity Multi-Model CNN in-Loop Filter for AVS3
● Digital Signal Processing
○ Learning Structured Sparsity For Time-Frequency Reconstruction
○ Learning Approach For Fast Approximate Matrix Factorizations
● Communication Systems
○ Adaptive Wireless Power Allocation with Graph Neural Networks
○ Deep Joint Source-Channel Coding for Wireless Image Transmission with Adaptive Rate Control
● Beamforming
○ Deep learning for location based beamforming with NLOS channels
○ Phase-Only Reconﬁgurable Sparse Array Beamforming Using Deep Learning

II. Topics related to our tasks

Multilingualism / Cross-lingualism

Joint Unsupervised and Supervised Training for Multilingual
ASR (Google)
● Most existing methods adopt a 2-stage scheme where
the self-supervised loss is optimized in the ﬁrst
pretraining stage, and the standard supervised
ﬁne-tuning resumes in the second stage.
● In this paper, we propose an end-to-end (E2E) Joint
Unsupervised and Supervised Training (JUST) method
to combine the supervised loss and the
self-supervised contrastive and masked language
modeling (MLM) losses.
● Spectrogram + Quantizer (wav2vec 2.0) + RNN-T
● Wins over XLSR-53!

Pseudo-Labeling for Massively Multilingual Speech
Recognition (Facebook)
● Prev works (from Facebook, similar
authors)
○ Iterative Pseudo-Labeling for Speech Recognition
(IPL) → LM + beam search to generate pseudo
labels
○ slimIPL: Language-model-free iterative
pseudo-labeling (slimIPL) → Use self-predictions
● Utilizing unlabeled data is helpful, even
with trivial methods.

Multilingual Text-To-Speech Training Using Cross Language
Voice Conversion And Self-Supervised Learning Of Speech
Representations (Facebook)
● It’s hard to find speakers who have
native proficiency in several
languages.
● Using HifiGAN-like model to
augment data (Synthetic generation
of target speaker speaking different
language)

A Configurable Multilingual Model is All You Need to
Recognize All Languages (Microsoft)
● Configurable multilingual model
(CMM) to recognize speech from
any combination of languages
based on a multi-hot LID vector
selected by users
● Language-specific vocabulary
strategy (making vocab smaller)
● Language-specific transformer
cell (one per language)

Zero-Shot Cross-Lingual Transfer Using Multi-Stream
Encoder and Efﬁcient Speaker Representation (Tencent)
● Extract speaker embedding features that are
independent of both content information and
language identity.
● Multi-stream = Input text sequences are fed
into N-stream text encoders in parallel
● zero-shot cross-lingual transfer strategy =
ﬁne-tune also with target-lingual data +
language-balanced sampling strategy

Tackling data scarcity in speech translation using zero-shot
multilingual machine translation techniques
● To tackle data scarcity, it is useful to make use of ASR and MT data for
end-to-end ST models. We explore techniques from zero-shot multilingual text
translation and apply them to speech side.
● Use tokens & augmentation
methods to make the model
decide output language based
on language tokens.

Multi-Lingual Multi-Task Speech Emotion Recognition Using
wav2vec 2.0
● Multi-task learning to increase emotion recognition performance
● Additional tasks
○ Gender Prediction (Ge)
○ Language Prediction (La)
○ F0 mean and standard deviation regression task (F0-me, F0-st)
○ Energy mean and standard deviation regression task (En–me, En-st)
○ Voice ratio regression task (Vr)

ADIMA: Abuse Detection In Multilingual Audio
● ADIMA, a novel, linguistically diverse, ethically sourced, expert annotated and
wellbalanced multilingual abuse detection audio dataset comprising of 11,775
audio samples in 10 Indic languages spanning 65 hours and spoken by 6,446
unique users.

SERAB: A multi-lingual benchmark for speech emotion
recognition
● Speech Emotion Recognition Adaptation Benchmark (SERAB), a framework for
evaluating the performance and generalization capacity of different approaches
for utterance-level SER.

Why still keyword spotting?
● For many voice-enabled platforms, queries follow a highly Zipﬁan distribution.
On the Comcast X1 entertainment system, for example, the top-20 commands
constitute around 30% of the trafﬁc.
● Using an ASR system is excessive for targeting phonetically distinct commands
with a small vocabulary.
● Audio-only based wake word spotting (WWS), a special case of KWS, is
challenging under noisy conditions due to the environmental interference.

Temporal early exiting for streaming speech commands
recognition (Comcast)
● Additionally add prediction heads, stop inference mid-way based on entropy.

A Study of Designing Compact Audio-Visual Wake Word
Spotting System Based on Iterative Fine-Tuning in Neural
Network Pruning
● Audio-visual keyword
spotting
● Using both is helpful

Text Adaptive Detection for Customizable Keyword Spotting
● Novel text adaptive detection
framework to directly formulate
KWS as a detection rather than a
classiﬁcation problem
● Text prompt is used as input, i.e.,
customizable wake words

Joint Ego-Noise Suppression and Keyword Spotting on
Sweeping Robots (Alibaba)
● a novel approach for joint ego-noise (self-created noise) suppression and
keyword detection
● Small footprint keyword spotting (KWS) on sweeping robot, i.e., the
conversation triggering module of the audio interface
● A circular microphone array of M = 6 → Multiple minimum variance
distortionless response (MVDR) beamformers
● If the keyword is present, noise adaptation will be slowed down to prevent
keyword speech being cancelled.

Unified Speculation, Detection, and Verification Keyword
Spotting (Alexa)
● Speculation → early decision (giving a head start, reduce system latency)
● Detection → keyword trigger task, more accurate decision
● Verification → verifies previous decision (correct mistakes)
● The proposed latency-aware max-pooling loss can control latency accuracy
trade-off effectively.

An Adapter Based Pre-Training for Efﬁcient and Scalable
Self-Supervised Speech Representation Learning (Huawei)
Apply adapters (B) to original w2v2 (A) to combat language forgetting.
https://www.notion.so/hpcnt/An-Adapter-Based-Pre-Training-for-Efﬁcient-and-Scalable-Self-Supervised-Speech-Representation-Learn-004
6747a578d4899b914e520959e01e8

Efﬁcient Adapter Transfer of Self-Supervised Speech
Models for Automatic Speech Recognition (Huawei)
● Fine-tune on ASR task
● Apply adapters

Large-scale ASR Domain Adaptation by Self-and
Semi-supervised Learning (Google)
● Joint training with both RNN-T &
Self-supervised loss (wav2vec 2.0)
● Confidence Estimation Module (CEM)
→ To filter out low confidence samples in
pseudo-labels for Noisy student training
○ binary cross entropy between the estimated
confidence p and the binary target sequence c
● It utilizes Wav2vec2.0 loss on the causal
encoder, so there is no transition gap from
non-causal to causal.

Learning Domain-Invariant Transformation for Speaker
Verification
● Meta-learning to generate domain-invariant embeddings without pre-training
and fine-tuning
● Use both metric loss & classification loss together

Magic dust for cross-lingual adaptation of monolingual
wav2vec-2.0
● Monolingual wav2vec-2.0 is a good few-shot ASR learner in several languages
○ English → 8 Target languages
○ Performance up to 86% compared to XLSR
○ ASR Fine-Tuning on English hurts other languages
● Monolingual wav2vec2 model pre-trained on a high-resource language using
moderately-sized unlabeled data and small-sized labeled data in the target
language yields similar performance at XLSR
● Dropout Uncertainty-Driven Self-Training (DUST)
○ Leverages unlabeled data by pseudo-labeling (semi-supervised)
○ Student from a previous round becomes the teacher for the next round

Filteraugment: An acoustic environmental data
augmentation method
● FilterAugment mimics acoustic ﬁlters by applying different weights on
frequency bands, therefore enables model to extract relevant information from
wider frequency region.
● Improved version of frequency
masking which masks information
on random frequency bands.

Auditory-Based Data Augmentation for end-to-end
Automatic Speech Recognition
● Spectral smearing smooths the
speech spectrum and suppresses
details by broadening the
bandwidths of the auditory ﬁlters.
● Loudness recruitment compresses
amplitudes of different frequency
bands, simulates damaged ear.

Intermix: An Interference-Based Data Augmentation and
Regularization Technique for Automatic Deep Sound
Classiﬁcation
● Prev work: BC learning
○ Taking sound energy into account
● Prev work: SpeechMix
○ Similar to manifold mixup,
mix intermediate representations
● This work: InterMix
○ Also apply phase shifts to inputs
& use it when mixing

Robust Speaker Verification Using Population-Based Data
Augmentation
● A population-based searching strategy for optimizing the augmentation
parameters.
● Instead of finding a fixed set of hyper-parameters, PBA learns a scheduler for
setting the hyper-parameters.
● List of augmentation used
○ Reverberation: Convolve with room impulse response (RIR)
○ Music: Music from a randomly selected MUSAN
○ Noise: Noise from MUSAN is added
○ Babble: Babble noise is added
○ Frequency masking
○ Time masking

Various augmentations
● LPC Augment: an LPC-based ASR Data Augmentation Algorithm for Low and
Zero-Resource Children's Dialects
○ The data augmentation procedure consists of perturbing the formant peaks of the Linear
predictive coding (LPC) spectrum during LPC analysis and reconstruction.
○ Compared with SpegAug & Speed perturbation. Did not show absolute advantage.
● ImportantAug: A Data Augmentation Agent for Speech
○ Adding noise to unimportant regions of the speech and not to important regions.
○ Importance is predicted for each utterance by a data augmentation agent that is trained to
maximize the amount of noise it adds while minimizing its impact on recognition performance.
● Fraug: A Frame Rate Based Data Augmentation Method for Depression
Detection from Speech Signals
○ Changing the frame-width and the frame-shift parameters during the feature extraction process

Task-specific Augmentations
● Cross-speaker style transfer for text-to-speech using data augmentation
○ Cross-speaker style transfer for TTS using data augmentation via voice conversion
● Spatial mixup: Directional loudness modification as data augmentation for
sound event localization and detection
○ application of parametric spatial audio effects for data augmentation, which modifies the
directional properties of a multi-channel spatial audio signal encoded in the ambisonics domain.
● Spatial Data Augmentation with Simulated Room Impulse Responses for Sound
Event Localization and Detection
○ Augments spatial characteristics using simulated room impulse responses (RIR). simulated RIRs
are convolved with the source signals to obtain an augmented multi-channel training dataset.
● Distribution augmentation for low-resource expressive text-to-speech
○ Data augmentation through word permutations & Constituency parse based tree substitutions

Federated learning challenges and opportunities: An
outlook (Amazon Alexa)
● Finding the lower limit of the number of communication rounds
○ Many local updates (for communication efﬁciency) can still converge to a desirable model.
○ Overly aggressive local updates will harm the performance due to the data heterogeneity
● Constraints
○ Memory constraint (each on-device model needs to be small in size)
○ Computation constraint (devices may perform only a limited number of gradient updates)
● Personalized FL
○ Conventional FL trains one model, personalized FL maintains a collection of client-speciﬁc
models
○ Will reduce test errors beyond what is possible with a single global model.
● Challenges of Lifelong FL
○ Online updates with single-pass data
○ Coupling of model training and data generation.
● Challenges on data
○ Data polarity (collected data does not represent the whole data distribution)
○ Data dependency (data are collected from time series with inevitable dependency)

Learnings from Federated Learning in the Real world (Alexa)
● Skewness: “heavy devices” with large amounts of data while there are many
“light users” with only a handful of data points.
● Non-uniform device selection outperforms uniform sampling of FL where it
utilizes the number of input points per device.
● We compare one-shot FL (Uses full range of data, single training) with continual
FL (Avoid storing data, multiple training rounds). We show that continual FL
outperforms the one-shot strategy in some setting, and is overall most
beneﬁcial for heavy devices.

Enabling on-device training of speech recognition models
with federated dropout (Google)
● Communication/computation costs are strongly correlated with the size of the
model being trained. We propose using federated dropout to reduce the size of
client models while training a full-size model server-side.
● Furthermore, we ﬁnd that federated dropout makes
smaller sub-models to have lower WER, making it
easier to dynamically adjust the model size.
● We use a realistic setting for federated training
of ASR models, wherein a well trained server-side
model is adapted to a new domain with FL on
edge devices.

Federated Self-supervised Learning
● Federated Self-Training for Data-Efficient Audio Recognition (Philips Research)
○ Self-training approach to exploit large-scale on-device unlabeled data to improve the
generalization of audio recognition models
○ Generate pseudo labels & train with softened labels
● Federated Self-Supervised Learning for Acoustic Event Classification (Amazon)
○ Applying FL to improve acoustic event classification (AEC) performance while no customer data
can be directly uploaded to the server
○ No pseudo labels (Common in AEC)
○ Solve the task of predicting the future
audio frame via feature representation

Trends of ICASSP 2022

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Trends of ICASSP 2022

Similar to Trends of ICASSP 2022 (20)

More from Kwanghee Choi

More from Kwanghee Choi (19)

Recently uploaded

Recently uploaded (20)

Trends of ICASSP 2022