Towards Learning Universal Audio Representations
● HARES: New BLEU-like benchmark
● SlowFast (from Video) + NFNet
(from Vision) seems to be great.
○ SlowFast: Two branches with
bigger/smaller kernel width
○ NFNet: Normalizer-Free ResNets
● CPC (contrastive learning) works
Universal paralinguistic speech representations using
self-supervised conformers (Google)
● Contrastive learning (of w2v2) on Conformers
○ Future works already conducted on distilling this model (TRILLsson)
● Closely following previous work BigSSL (Google)
○ Trained with speech-heavy youtube videos
○ Their conclusion: SSL + Large Models are especially helpful for small datasets
● Best performance wasn’t from the
ﬁnal layer’s feature vector.
(Same conclusion from BigSSL)
→ CAP12 (12th layer feature outputs)
A Noise-Robust Self-supervised Pre-training Model Based
Speech Representation Learning for Automatic Speech
Making the w2v feature encoder robust to
additional noise via contrastive loss
Wav2vec-switch: Contrastive learning from original-noisy
speech pairs for robust speech recognition (Microsoft)
● Contextualized representation being robust to noise
DistilHuBERT: Speech representation learning by layer-wise
distillation of hidden-unit BERT (NTU)
Improving Self-Supervised Learning for Speech Recognition
with Intermediate Layer Supervision (Microsoft)
● The common practice of SSL is to compute the
self-supervised loss on the top layer, such as
wav2vec 2.0 and HuBERT.
● However, the lower layers of such a pre-trained
model is shown to have a low correlation with
● In this work, we propose to apply intermediate
layer supervision to encourage lower layers to
learn content knowledge → Apply exact same
thing of HUBERT to lower layers
Exploring Heterogeneous Characteristics of Layers in ASR
Models for More Efﬁcient Training (Google)
● Based on “Are All Layers Created Equal?” (Bengio)
○ Fix intermediate layers’ weights to some other weights
○ Re-initialization: Come back to initial values
○ Re-randomization: Get random weights
● While ambient layers were present in all model sizes, we observed that larger
models had more ambient layers, i.e., overparameterized models.
● During early rounds, the ambient layers were more
spread throughout the model; only later
the separation become more distinct.
● GN was more robust (against re-random) than BN.
Investigation of Robustness of Hubert Features from
Different Layers to Domain, Accent and Language Variations
● Our experiments indicate that as domain, accent,
bandwidth and language deviates from the source domain,
the relative improvement decreases.
● The last layer of HuBERT is very speciﬁc to the dataset on
which it is trained. The second last layer seems to be better
when there is domain and accent differences.
● Middle layers are more suited when data is from a different
Don't speak too fast: The impact of data bias on
self-supervised speech models (NTU)
● Use SUPERB benchmark to differ Gender, Content, Speech speed of
● Gender → Adding few minor class samples will mitigate performance drop
● Content → Model didn’t care perplexity
● Speech speed → Faster speech is worse
Speech anonymization (Emmanuel Vincent)
● Speech information
○ Verbal content (identiﬁers, private info, etc)
○ Speaker (identity, gender, age, ethnic origin, etc)
○ Nonverbal content (emotion, health, etc)
○ Acoustic environment (acoustics, other speakers, etc)
○ User proﬁling, user identiﬁcation, voice cloning,
○ Embedded systems, Cryptography, Obfuscation, Anonymization, Federated Learning, etc
○ Simple modiﬁcations (ex. Pitch shifts) utterly fail for knowledgeable attackers
● Current speech anonymization challenge != Legal defn.
○ It seems that many big companies doesn’t anonymize speech (collected from various sources)
○ Task: (1) ASR (2) Emotion recognition
Preserving Trajectory Privacy in Driving Data Release
● What comes with the innovative services provided by intelligent transport
systems (ITS) are potential privacy attacks.
● For example, in trafﬁc monitoring systems, individual users send anonymized
personal location traces continuously to aid in trafﬁc state estimation.
● However, an adversary may link an anonymous GPS trace to a particular person
provided additional knowledge of the person’s residence or working location.
● This can not be achieved by data encryption or hiding the driver identity. We
resort to the notion of inference privacy that sanitizes raw data to limit the
amount of contained private information.
Audio Deepfake Detection 2022: the First Audio Deep
Synthesis Detection Challenge
● Low-quality fake audio detection: focuses on dealing with bona ﬁde and fully
fake utterances with various real-world noises etc
○ Fully generated utterances
● Partially fake audio detection: distinguish the partially fake audio from the real
○ Generated by manipulating the genuine utterances
● Audio fake game: Solve both an audio generation task and an audio fake
Aasist: Audio anti-spooﬁng using integrated
spectro-temporal graph attention networks (Naver)
● Spooﬁng detection solutions can be an important consideration when
automatic speaker veriﬁcation systems are deployed in real-world applications.
● Two major scenarios:
○ Logical access (LA): spooﬁng attacks mounted with voice conversion and TTS
○ Physical access (PA): bona ﬁde utterances are captured and then replayed
● Recent studies show that discriminative information (i.e., spooﬁng artefacts)
can reside in speciﬁc temporal and spectral intervals
Characterizing the adversarial vulnerability of speech
self-supervised learning (Helen Meng)
● Speech processing Universal PERformance Benchmark (SUPERB)
○ Upstream model (self-supervised models) + Downstream models (directly uses features, ex.
● Adversarial Attacks
○ Limited-knowledge adversaries: Attackers can access the internals of the target model
(parameters and gradients). But they do not know which downstream task will be conducted.
○ Zero-knowledge adversaries: Target model is unavailable to the attackers. In such a case, the
substitute model is used for approximating gradients for adversarial sample generation.
○ XAB listening test: check if humans can distinguish adversarial samples
● Results: Attacks are effective, humans cannot easily distinguish.
Adversarial Sample Detection for Speaker Veriﬁcation by
Neural Vocoders (Tencent)
● Automatic speaker veriﬁcation (ASV), one of the most important technology for
biometric identiﬁcation, has been widely adopted in security-critical
● However, ASV is seriously vulnerable to recently emerged adversarial attacks,
yet effective countermeasures against them are limited.
Source Mixing and Separation Robust Audio Steganography
● Audio steganography is the science of
concealing secret messages inside a host
audio called a carrier in such a way that the
concealment is unnoticeable to human ears.
● Recently, deep neural networks (DNNs) have
been used as a steganographic function for
hiding data inside images to achieve high
● The network learns to conceal a hidden
message inside the carrier without manually
specifying a particular redundancy to exploit.
PixInWav: Residual Steganography for
Hiding Pixels in Audio
Exploiting language model for efﬁcient linguistic steganalysis
● Linguistic steganography (LS)
○ Natural language is actually quite suitable for steganography.
○ The advantage is that LS can be easily concealed by the huge number of social activities.
○ (1) modiﬁcation based and (2) generation based
○ Latter allows more data to be embedded
● Steganalysis = to detect whether there is secret data embedded in the media
● Signiﬁcant difference between automatically generated stego texts and carrier
texts in terms of the conditional probability distribution of individual words.
Acoustic Echo Cancellation
● Acoustic echo refers to the phenomenon that occurs when a microphone picks
up the far-end signal that is played by a loudspeaker.
● This phenomenon can cause a slight annoyance or a signiﬁcant breakdown in a
● ICASSP 2022 AEC Challenge by Microsoft
● Various scenarios
○ Long- or varying delays
○ Strong speaker/mic distortions
○ Stationary/non-stationary noise
○ Glitches (due to high CPU usages)
Deep Noise Suppression
● Audio calls in the presence of
background noises get
signiﬁcantly degraded in terms
of quality/intelligibility of the
● ICASSP 2022 Deep Noise
Suppression Challenge by
Multi-Channel Multi-Party Meeting Transcription
● Speaker Diarization
○ Partitioning an input audio stream into homogeneous segments according to the speaker
identity, i.e. "who spoke when?”
● Multi-speaker ASR
○ Hard to do overlapped speech recognition due to the interfering speakers or background noise
● ICASSP 2022 M2MeT Challenge by Alibaba
VarArray: Array-geometry-agnostic continuous speech
● Continuous speech separation using a microphone array was shown to be
promising in dealing with the speech overlap problem.
● Signals highly depend on the position of the microphones.
● In meetings, we can assume only two or fewer speakers to be active for the
majority of the meeting time.
● Audio-Visual Object Classiﬁcation For Human-Robot Collaboration
● Multimodal Information Based Speech Processing
● Machine Translation for Spoken and Written Language
● Image and Video Understanding
● Multimodal Signal Processing, Analysis, and Synthesis
● Audio Security and Multi-Modal Systems
● Multi-modal Analysis and Synthesis
● Multimodal Data Fusion and Processing
● Multimodal Analysis in Audio Applications
● Speech emotion recognition using self-supervised features
○ A modular End-to-End SER system based on an Upstream + Downstream architecture paradigm,
which allows easy use/integration of a large variety of self-supervised features.
● Memobert: Pre-training model with prompt-based learning for multimodal
○ learns multimodal joint representations through self-supervised learning
○ prompt-based method that reformulates emotion classiﬁcation as a masked text prediction
● Multimodal Emotion Recognition with Surgical and Fabric Masks
○ investigate how mufﬂed speech and occluded facial expressions change the prediction of
Speech as a Disease Biomarker
● Fraug: A Frame Rate Based Data Augmentation Method for Depression Detection
from Speech Signals
○ Among others, the speech signal is an important biomarker of our mental state and can be collected
remotely, in a non-invasive manner with no expert supervision.
○ Recently, speech-based automatic diagnosis of depression has gained signiﬁcant momentum.
● Exploring Dementia Detection from Speech: Cross Corpus Analysis
○ Population aging is responsible for an increase of new Alzheimer’s disease (AD) cases, and creates the
need for scalable, cost-effective methods that are able to detect early stage AD.
○ Speech and language biomarkers are strong indicators of dementia, and provide a low-cost and
widespread alternative for the assessment of cognitive states.
● The Second Dicova Challenge: Dataset and Performance Analysis for Diagnosis of
Covid-19 Using Acoustics
○ Dataset of audio recordings consisting of breathing, cough and speech signals
○ Providing a point-of-care, rapid, easy to use, and cost-effective tool to help contain COVID-19 spread.
● Robust disentangled variational speech representation learning for zero-shot
voice conversion (Tencent)
○ Feeding an arbitrary speaker embedding and content embeddings to the VAE decoder
● Controllable Speech Representation Learning Via Voice Conversion and AIC
○ Its disentangled components (content, pitch, speaker identity, and energy) can be controlled
independently to alter the synthesis result.
● An Investigation of Streaming Non-Autoregressive sequence-to-sequence
● Voice Filter: Few-shot text-to-speech speaker adaptation using voice
conversion as a post-processing module (Amazon)
○ It uses voice conversion (VC) as a post-processing module appended to a pre-existing
high-quality TTS system, framing the few-shot TTS problem as a VC task.
● HiFi-SVC: Fast High Fidelity Cross-Domain Singing Voice Conversion
● Music Enhancement via Image Translation and Vocoding (Adobe)
● Source Separation By Steering Pretrained Music Models
● MELONS: generating melody with long-term structure using transformers and
● Genre-Conditioned Long-Term 3D Dance Generation Driven by Music
● Deep Performer: Score-to-Audio Music Performance Synthesis (Dolby)
● SleepGAN: Towards Personalized Sleep Therapy Music (Nokia)
● Modeling beats and downbeats with a time-frequency Transformer (ByteDance)
Quantum Machine Learning
● Languages: Google Cirq / Microsoft Q# / IBM Qiskit
● Services: Google Quantum AI / Azure Quantum / IBM Quantum
● The dawn of quantum natural language processing
○ We successfully train a quantum-enhanced Long Short-Term Memory network to perform the
parts-of-speech tagging task via numerical simulations.
○ Practical applications are more likely to be a hybrid of classical and quantum operations. This
hybrid approach is not too different from what has been done in the past decade with GPUs.
○ The main idea behind Quantum Machine Learning (QML) is to replace parts of a neural network
(e.g. linear layers) with a quantum counterpart.
● Quantum federated learning with quantum data
○ Hybrid models fall short when dealing with the highly complex purely quantum data.
○ Thus, purely quantum ML models that can address these challenges were developed, such as
quantum neural networks (QNNs).
○ However, due to the fragile nature of the carriers of quantum data, i.e., qubits, there is a natural
need for distributed learning solutions such as federated learning (FL).
Machine Learning is All You Need
● Audio Representations
○ Learnable Wavelet Packet Transform for Data-Adapted Spectrograms
○ A Low-Parametric Model for Bit-Rate Estimation of VVC Residual Coding
○ Low-Complexity Multi-Model CNN in-Loop Filter for AVS3
● Digital Signal Processing
○ Learning Structured Sparsity For Time-Frequency Reconstruction
○ Learning Approach For Fast Approximate Matrix Factorizations
● Communication Systems
○ Adaptive Wireless Power Allocation with Graph Neural Networks
○ Deep Joint Source-Channel Coding for Wireless Image Transmission with Adaptive Rate Control
○ Deep learning for location based beamforming with NLOS channels
○ Phase-Only Reconﬁgurable Sparse Array Beamforming Using Deep Learning
Joint Unsupervised and Supervised Training for Multilingual
● Most existing methods adopt a 2-stage scheme where
the self-supervised loss is optimized in the ﬁrst
pretraining stage, and the standard supervised
ﬁne-tuning resumes in the second stage.
● In this paper, we propose an end-to-end (E2E) Joint
Unsupervised and Supervised Training (JUST) method
to combine the supervised loss and the
self-supervised contrastive and masked language
modeling (MLM) losses.
● Spectrogram + Quantizer (wav2vec 2.0) + RNN-T
● Wins over XLSR-53!
Pseudo-Labeling for Massively Multilingual Speech
● Prev works (from Facebook, similar
○ Iterative Pseudo-Labeling for Speech Recognition
(IPL) → LM + beam search to generate pseudo
○ slimIPL: Language-model-free iterative
pseudo-labeling (slimIPL) → Use self-predictions
● Utilizing unlabeled data is helpful, even
with trivial methods.
Multilingual Text-To-Speech Training Using Cross Language
Voice Conversion And Self-Supervised Learning Of Speech
● It’s hard to ﬁnd speakers who have
native proﬁciency in several
● Using HiﬁGAN-like model to
augment data (Synthetic generation
of target speaker speaking different
A Conﬁgurable Multilingual Model is All You Need to
Recognize All Languages (Microsoft)
● Conﬁgurable multilingual model
(CMM) to recognize speech from
any combination of languages
based on a multi-hot LID vector
selected by users
● Language-speciﬁc vocabulary
strategy (making vocab smaller)
● Language-speciﬁc transformer
cell (one per language)
Zero-Shot Cross-Lingual Transfer Using Multi-Stream
Encoder and Efﬁcient Speaker Representation (Tencent)
● Extract speaker embedding features that are
independent of both content information and
● Multi-stream = Input text sequences are fed
into N-stream text encoders in parallel
● zero-shot cross-lingual transfer strategy =
ﬁne-tune also with target-lingual data +
language-balanced sampling strategy
Tackling data scarcity in speech translation using zero-shot
multilingual machine translation techniques
● To tackle data scarcity, it is useful to make use of ASR and MT data for
end-to-end ST models. We explore techniques from zero-shot multilingual text
translation and apply them to speech side.
● Use tokens & augmentation
methods to make the model
decide output language based
on language tokens.
Multi-Lingual Multi-Task Speech Emotion Recognition Using
● Multi-task learning to increase emotion recognition performance
● Additional tasks
○ Gender Prediction (Ge)
○ Language Prediction (La)
○ F0 mean and standard deviation regression task (F0-me, F0-st)
○ Energy mean and standard deviation regression task (En–me, En-st)
○ Voice ratio regression task (Vr)
ADIMA: Abuse Detection In Multilingual Audio
● ADIMA, a novel, linguistically diverse, ethically sourced, expert annotated and
wellbalanced multilingual abuse detection audio dataset comprising of 11,775
audio samples in 10 Indic languages spanning 65 hours and spoken by 6,446
SERAB: A multi-lingual benchmark for speech emotion
● Speech Emotion Recognition Adaptation Benchmark (SERAB), a framework for
evaluating the performance and generalization capacity of different approaches
for utterance-level SER.
Why still keyword spotting?
● For many voice-enabled platforms, queries follow a highly Zipﬁan distribution.
On the Comcast X1 entertainment system, for example, the top-20 commands
constitute around 30% of the trafﬁc.
● Using an ASR system is excessive for targeting phonetically distinct commands
with a small vocabulary.
● Audio-only based wake word spotting (WWS), a special case of KWS, is
challenging under noisy conditions due to the environmental interference.
Temporal early exiting for streaming speech commands
● Additionally add prediction heads, stop inference mid-way based on entropy.
A Study of Designing Compact Audio-Visual Wake Word
Spotting System Based on Iterative Fine-Tuning in Neural
● Audio-visual keyword
● Using both is helpful
Text Adaptive Detection for Customizable Keyword Spotting
● Novel text adaptive detection
framework to directly formulate
KWS as a detection rather than a
● Text prompt is used as input, i.e.,
customizable wake words
Joint Ego-Noise Suppression and Keyword Spotting on
Sweeping Robots (Alibaba)
● a novel approach for joint ego-noise (self-created noise) suppression and
● Small footprint keyword spotting (KWS) on sweeping robot, i.e., the
conversation triggering module of the audio interface
● A circular microphone array of M = 6 → Multiple minimum variance
distortionless response (MVDR) beamformers
● If the keyword is present, noise adaptation will be slowed down to prevent
keyword speech being cancelled.
Uniﬁed Speculation, Detection, and Veriﬁcation Keyword
● Speculation → early decision (giving a head start, reduce system latency)
● Detection → keyword trigger task, more accurate decision
● Veriﬁcation → veriﬁes previous decision (correct mistakes)
● The proposed latency-aware max-pooling loss can control latency accuracy
An Adapter Based Pre-Training for Efﬁcient and Scalable
Self-Supervised Speech Representation Learning (Huawei)
Apply adapters (B) to original w2v2 (A) to combat language forgetting.
Efﬁcient Adapter Transfer of Self-Supervised Speech
Models for Automatic Speech Recognition (Huawei)
● Fine-tune on ASR task
● Apply adapters
Large-scale ASR Domain Adaptation by Self-and
Semi-supervised Learning (Google)
● Joint training with both RNN-T &
Self-supervised loss (wav2vec 2.0)
● Conﬁdence Estimation Module (CEM)
→ To ﬁlter out low conﬁdence samples in
pseudo-labels for Noisy student training
○ binary cross entropy between the estimated
conﬁdence p and the binary target sequence c
● It utilizes Wav2vec2.0 loss on the causal
encoder, so there is no transition gap from
non-causal to causal.
Learning Domain-Invariant Transformation for Speaker
● Meta-learning to generate domain-invariant embeddings without pre-training
● Use both metric loss & classiﬁcation loss together
Magic dust for cross-lingual adaptation of monolingual
● Monolingual wav2vec-2.0 is a good few-shot ASR learner in several languages
○ English → 8 Target languages
○ Performance up to 86% compared to XLSR
○ ASR Fine-Tuning on English hurts other languages
● Monolingual wav2vec2 model pre-trained on a high-resource language using
moderately-sized unlabeled data and small-sized labeled data in the target
language yields similar performance at XLSR
● Dropout Uncertainty-Driven Self-Training (DUST)
○ Leverages unlabeled data by pseudo-labeling (semi-supervised)
○ Student from a previous round becomes the teacher for the next round
Filteraugment: An acoustic environmental data
● FilterAugment mimics acoustic ﬁlters by applying different weights on
frequency bands, therefore enables model to extract relevant information from
wider frequency region.
● Improved version of frequency
masking which masks information
on random frequency bands.
Auditory-Based Data Augmentation for end-to-end
Automatic Speech Recognition
● Spectral smearing smooths the
speech spectrum and suppresses
details by broadening the
bandwidths of the auditory ﬁlters.
● Loudness recruitment compresses
amplitudes of different frequency
bands, simulates damaged ear.
Intermix: An Interference-Based Data Augmentation and
Regularization Technique for Automatic Deep Sound
● Prev work: BC learning
○ Taking sound energy into account
● Prev work: SpeechMix
○ Similar to manifold mixup,
mix intermediate representations
● This work: InterMix
○ Also apply phase shifts to inputs
& use it when mixing
Robust Speaker Veriﬁcation Using Population-Based Data
● A population-based searching strategy for optimizing the augmentation
● Instead of ﬁnding a ﬁxed set of hyper-parameters, PBA learns a scheduler for
setting the hyper-parameters.
● List of augmentation used
○ Reverberation: Convolve with room impulse response (RIR)
○ Music: Music from a randomly selected MUSAN
○ Noise: Noise from MUSAN is added
○ Babble: Babble noise is added
○ Frequency masking
○ Time masking
● LPC Augment: an LPC-based ASR Data Augmentation Algorithm for Low and
Zero-Resource Children's Dialects
○ The data augmentation procedure consists of perturbing the formant peaks of the Linear
predictive coding (LPC) spectrum during LPC analysis and reconstruction.
○ Compared with SpegAug & Speed perturbation. Did not show absolute advantage.
● ImportantAug: A Data Augmentation Agent for Speech
○ Adding noise to unimportant regions of the speech and not to important regions.
○ Importance is predicted for each utterance by a data augmentation agent that is trained to
maximize the amount of noise it adds while minimizing its impact on recognition performance.
● Fraug: A Frame Rate Based Data Augmentation Method for Depression
Detection from Speech Signals
○ Changing the frame-width and the frame-shift parameters during the feature extraction process
● Cross-speaker style transfer for text-to-speech using data augmentation
○ Cross-speaker style transfer for TTS using data augmentation via voice conversion
● Spatial mixup: Directional loudness modiﬁcation as data augmentation for
sound event localization and detection
○ application of parametric spatial audio effects for data augmentation, which modiﬁes the
directional properties of a multi-channel spatial audio signal encoded in the ambisonics domain.
● Spatial Data Augmentation with Simulated Room Impulse Responses for Sound
Event Localization and Detection
○ Augments spatial characteristics using simulated room impulse responses (RIR). simulated RIRs
are convolved with the source signals to obtain an augmented multi-channel training dataset.
● Distribution augmentation for low-resource expressive text-to-speech
○ Data augmentation through word permutations & Constituency parse based tree substitutions
Federated learning challenges and opportunities: An
outlook (Amazon Alexa)
● Finding the lower limit of the number of communication rounds
○ Many local updates (for communication efﬁciency) can still converge to a desirable model.
○ Overly aggressive local updates will harm the performance due to the data heterogeneity
○ Memory constraint (each on-device model needs to be small in size)
○ Computation constraint (devices may perform only a limited number of gradient updates)
● Personalized FL
○ Conventional FL trains one model, personalized FL maintains a collection of client-speciﬁc
○ Will reduce test errors beyond what is possible with a single global model.
● Challenges of Lifelong FL
○ Online updates with single-pass data
○ Coupling of model training and data generation.
● Challenges on data
○ Data polarity (collected data does not represent the whole data distribution)
○ Data dependency (data are collected from time series with inevitable dependency)
Learnings from Federated Learning in the Real world (Alexa)
● Skewness: “heavy devices” with large amounts of data while there are many
“light users” with only a handful of data points.
● Non-uniform device selection outperforms uniform sampling of FL where it
utilizes the number of input points per device.
● We compare one-shot FL (Uses full range of data, single training) with continual
FL (Avoid storing data, multiple training rounds). We show that continual FL
outperforms the one-shot strategy in some setting, and is overall most
beneﬁcial for heavy devices.
Enabling on-device training of speech recognition models
with federated dropout (Google)
● Communication/computation costs are strongly correlated with the size of the
model being trained. We propose using federated dropout to reduce the size of
client models while training a full-size model server-side.
● Furthermore, we ﬁnd that federated dropout makes
smaller sub-models to have lower WER, making it
easier to dynamically adjust the model size.
● We use a realistic setting for federated training
of ASR models, wherein a well trained server-side
model is adapted to a new domain with FL on
Federated Self-supervised Learning
● Federated Self-Training for Data-Efﬁcient Audio Recognition (Philips Research)
○ Self-training approach to exploit large-scale on-device unlabeled data to improve the
generalization of audio recognition models
○ Generate pseudo labels & train with softened labels
● Federated Self-Supervised Learning for Acoustic Event Classiﬁcation (Amazon)
○ Applying FL to improve acoustic event classiﬁcation (AEC) performance while no customer data
can be directly uploaded to the server
○ No pseudo labels (Common in AEC)
○ Solve the task of predicting the future
audio frame via feature representation