SlideShare a Scribd company logo
Kwanghee Choi
Before we start
○ De-facto conference for audio/speech (with InterSpeech)
● 2022 ICASSP Stats
○ 3967 papers submitted, 1785 (45.0%) accepted
○ About ~100 papers skimmed
● Slides
○ General topics & Service-related topics
○ Can’t go through everything, will try to finish quickly
1. General Trend
1. 2022 SotA
2. Contrastive / Self-supervised
3. Security
4. Post-COVID Teleconferencing
5. Applications
2. Topics related to our tasks
1. Multilingualism / Cross-lingualism
2. Keyword Spotting
3. Few-shot / Low-shot
4. Audio Augmentation
5. Federated Learning
I. General Trend
1.1 2022 SotA
General Models
● Wav2vec (Facebook)
○ Raw audio + 1D Convs + BERT + CTC Loss (Similar to MLM, feature distractors)
● Wav2vec 2.0 (Facebook)
○ Raw audio + 1D Convs + BERT + Codebooks + Contrastive loss (quantized distractors)
● HuBERT (Facebook)
○ Raw audio + 1D Convs + BERT + MFCC-based Clustering + Cluster Ensembles
● BigSSL/CAP12 (Google)
○ Spectrogram + 1D Convs + Conformers + Contrastive loss (quantized distractors)
● Data2vec (Facebook)
○ Raw audio + 1D Convs + BERT + Student-teacher (EMA) prediction
● WavLM (Microsoft)
○ HuBERT + Speech denoising + Gated relative position bias
GLUE-like Benchmarks
○ Speech processing Universal PERformance Benchmark
○ Recognition: Phoneme Recognition, Automatic Speech Recognition
○ Detection: Keyword Spotting, Query by Example Spoken Term Detection
○ Semantics: Intent Classification, Slot Filling, Speech Translation
○ Speaker: Speaker Identification, Automatic Speaker Verification, Speaker Diarization
○ Paralinguistics: Emotion Recognition
○ Generation: Speech enhancement, Speech Separation
● NOSS (Google)
○ NOn-Semantic Speech Benchmark
○ Speaker identification, Language identification, Command, Emotion, Dementia/healthy
● HARES (Deepmind)
○ Holistic audio representation evaluation suite
○ Environment: Audio tagging, Animal/Scene classification
○ Speech: Keyword Spotting, Intention Classification, Language identification, Speaker identification
○ Music: Instrument Identification, Pitch estimation, Music tagging
1.2 Contrastive / Self-supervised
Towards Learning Universal Audio Representations
● HARES: New BLEU-like benchmark
● SlowFast (from Video) + NFNet
(from Vision) seems to be great.
○ SlowFast: Two branches with
bigger/smaller kernel width
○ NFNet: Normalizer-Free ResNets
● CPC (contrastive learning) works
quite well.
Universal paralinguistic speech representations using
self-supervised conformers (Google)
● Contrastive learning (of w2v2) on Conformers
○ Future works already conducted on distilling this model (TRILLsson)
● Closely following previous work BigSSL (Google)
○ Trained with speech-heavy youtube videos
○ Their conclusion: SSL + Large Models are especially helpful for small datasets
● Best performance wasn’t from the
final layer’s feature vector.
(Same conclusion from BigSSL)
→ CAP12 (12th layer feature outputs)
A Noise-Robust Self-supervised Pre-training Model Based
Speech Representation Learning for Automatic Speech
Making the w2v feature encoder robust to
additional noise via contrastive loss
Wav2vec-switch: Contrastive learning from original-noisy
speech pairs for robust speech recognition (Microsoft)
● Contextualized representation being robust to noise
w2v2 loss
DistilHuBERT: Speech representation learning by layer-wise
distillation of hidden-unit BERT (NTU)
Improving Self-Supervised Learning for Speech Recognition
with Intermediate Layer Supervision (Microsoft)
● The common practice of SSL is to compute the
self-supervised loss on the top layer, such as
wav2vec 2.0 and HuBERT.
● However, the lower layers of such a pre-trained
model is shown to have a low correlation with
phonetic information.
● In this work, we propose to apply intermediate
layer supervision to encourage lower layers to
learn content knowledge → Apply exact same
thing of HUBERT to lower layers
Exploring Heterogeneous Characteristics of Layers in ASR
Models for More Efficient Training (Google)
● Based on “Are All Layers Created Equal?” (Bengio)
○ Fix intermediate layers’ weights to some other weights
○ Re-initialization: Come back to initial values
○ Re-randomization: Get random weights
● While ambient layers were present in all model sizes, we observed that larger
models had more ambient layers, i.e., overparameterized models.
● During early rounds, the ambient layers were more
spread throughout the model; only later
the separation become more distinct.
● GN was more robust (against re-random) than BN.
Investigation of Robustness of Hubert Features from
Different Layers to Domain, Accent and Language Variations
● Our experiments indicate that as domain, accent,
bandwidth and language deviates from the source domain,
the relative improvement decreases.
● The last layer of HuBERT is very specific to the dataset on
which it is trained. The second last layer seems to be better
when there is domain and accent differences.
● Middle layers are more suited when data is from a different
Don't speak too fast: The impact of data bias on
self-supervised speech models (NTU)
● Use SUPERB benchmark to differ Gender, Content, Speech speed of
pre-trained datasets
● Gender → Adding few minor class samples will mitigate performance drop
● Content → Model didn’t care perplexity
● Speech speed → Faster speech is worse
Speech anonymization (Emmanuel Vincent)
● Speech information
○ Verbal content (identifiers, private info, etc)
○ Speaker (identity, gender, age, ethnic origin, etc)
○ Nonverbal content (emotion, health, etc)
○ Acoustic environment (acoustics, other speakers, etc)
● Risks
○ User profiling, user identification, voice cloning,
information leakage
● Methods
○ Embedded systems, Cryptography, Obfuscation, Anonymization, Federated Learning, etc
○ Simple modifications (ex. Pitch shifts) utterly fail for knowledgeable attackers
● Current speech anonymization challenge != Legal defn.
○ It seems that many big companies doesn’t anonymize speech (collected from various sources)
○ Task: (1) ASR (2) Emotion recognition
Preserving Trajectory Privacy in Driving Data Release
● What comes with the innovative services provided by intelligent transport
systems (ITS) are potential privacy attacks.
● For example, in traffic monitoring systems, individual users send anonymized
personal location traces continuously to aid in traffic state estimation.
● However, an adversary may link an anonymous GPS trace to a particular person
provided additional knowledge of the person’s residence or working location.
● This can not be achieved by data encryption or hiding the driver identity. We
resort to the notion of inference privacy that sanitizes raw data to limit the
amount of contained private information.
Audio Deepfake Detection 2022: the First Audio Deep
Synthesis Detection Challenge
● Low-quality fake audio detection: focuses on dealing with bona fide and fully
fake utterances with various real-world noises etc
○ Fully generated utterances
● Partially fake audio detection: distinguish the partially fake audio from the real
○ Generated by manipulating the genuine utterances
● Audio fake game: Solve both an audio generation task and an audio fake
detection task
Aasist: Audio anti-spoofing using integrated
spectro-temporal graph attention networks (Naver)
● Spoofing detection solutions can be an important consideration when
automatic speaker verification systems are deployed in real-world applications.
● Two major scenarios:
○ Logical access (LA): spoofing attacks mounted with voice conversion and TTS
○ Physical access (PA): bona fide utterances are captured and then replayed
● Recent studies show that discriminative information (i.e., spoofing artefacts)
can reside in specific temporal and spectral intervals
Characterizing the adversarial vulnerability of speech
self-supervised learning (Helen Meng)
● Speech processing Universal PERformance Benchmark (SUPERB)
○ Upstream model (self-supervised models) + Downstream models (directly uses features, ex.
● Adversarial Attacks
○ Limited-knowledge adversaries: Attackers can access the internals of the target model
(parameters and gradients). But they do not know which downstream task will be conducted.
○ Zero-knowledge adversaries: Target model is unavailable to the attackers. In such a case, the
substitute model is used for approximating gradients for adversarial sample generation.
○ XAB listening test: check if humans can distinguish adversarial samples
● Results: Attacks are effective, humans cannot easily distinguish.
Adversarial Sample Detection for Speaker Verification by
Neural Vocoders (Tencent)
● Automatic speaker verification (ASV), one of the most important technology for
biometric identification, has been widely adopted in security-critical
● However, ASV is seriously vulnerable to recently emerged adversarial attacks,
yet effective countermeasures against them are limited.
Source Mixing and Separation Robust Audio Steganography
● Audio steganography is the science of
concealing secret messages inside a host
audio called a carrier in such a way that the
concealment is unnoticeable to human ears.
● Recently, deep neural networks (DNNs) have
been used as a steganographic function for
hiding data inside images to achieve high
● The network learns to conceal a hidden
message inside the carrier without manually
specifying a particular redundancy to exploit.
PixInWav: Residual Steganography for
Hiding Pixels in Audio
Exploiting language model for efficient linguistic steganalysis
● Linguistic steganography (LS)
○ Natural language is actually quite suitable for steganography.
○ The advantage is that LS can be easily concealed by the huge number of social activities.
○ (1) modification based and (2) generation based
○ Latter allows more data to be embedded
● Steganalysis = to detect whether there is secret data embedded in the media
● Significant difference between automatically generated stego texts and carrier
texts in terms of the conditional probability distribution of individual words.
Post-COVID Teleconferencing
Acoustic Echo Cancellation
● Acoustic echo refers to the phenomenon that occurs when a microphone picks
up the far-end signal that is played by a loudspeaker.
● This phenomenon can cause a slight annoyance or a significant breakdown in a
communication system.
● ICASSP 2022 AEC Challenge by Microsoft
● Various scenarios
○ Long- or varying delays
○ Strong speaker/mic distortions
○ Stationary/non-stationary noise
○ Glitches (due to high CPU usages)
○ etc.
Deep Noise Suppression
● Audio calls in the presence of
background noises get
significantly degraded in terms
of quality/intelligibility of the
perceived speech.
● ICASSP 2022 Deep Noise
Suppression Challenge by
Multi-Channel Multi-Party Meeting Transcription
● Speaker Diarization
○ Partitioning an input audio stream into homogeneous segments according to the speaker
identity, i.e. "who spoke when?”
● Multi-speaker ASR
○ Hard to do overlapped speech recognition due to the interfering speakers or background noise
● ICASSP 2022 M2MeT Challenge by Alibaba
VarArray: Array-geometry-agnostic continuous speech
separation (Microsoft)
● Continuous speech separation using a microphone array was shown to be
promising in dealing with the speech overlap problem.
● Signals highly depend on the position of the microphones.
● In meetings, we can assume only two or fewer speakers to be active for the
majority of the meeting time.
Multimodal Systems
● Audio-Visual Object Classification For Human-Robot Collaboration
● Multimodal Information Based Speech Processing
● Machine Translation for Spoken and Written Language
● Image and Video Understanding
● Multimodal Signal Processing, Analysis, and Synthesis
● Audio Security and Multi-Modal Systems
● Multi-modal Analysis and Synthesis
● Multimodal Data Fusion and Processing
● Multimodal Analysis in Audio Applications
Emotion Recognition
● Speech emotion recognition using self-supervised features
○ A modular End-to-End SER system based on an Upstream + Downstream architecture paradigm,
which allows easy use/integration of a large variety of self-supervised features.
● Memobert: Pre-training model with prompt-based learning for multimodal
emotion recognition
○ learns multimodal joint representations through self-supervised learning
○ prompt-based method that reformulates emotion classification as a masked text prediction
● Multimodal Emotion Recognition with Surgical and Fabric Masks
○ investigate how muffled speech and occluded facial expressions change the prediction of
Speech as a Disease Biomarker
● Fraug: A Frame Rate Based Data Augmentation Method for Depression Detection
from Speech Signals
○ Among others, the speech signal is an important biomarker of our mental state and can be collected
remotely, in a non-invasive manner with no expert supervision.
○ Recently, speech-based automatic diagnosis of depression has gained significant momentum.
● Exploring Dementia Detection from Speech: Cross Corpus Analysis
○ Population aging is responsible for an increase of new Alzheimer’s disease (AD) cases, and creates the
need for scalable, cost-effective methods that are able to detect early stage AD.
○ Speech and language biomarkers are strong indicators of dementia, and provide a low-cost and
widespread alternative for the assessment of cognitive states.
● The Second Dicova Challenge: Dataset and Performance Analysis for Diagnosis of
Covid-19 Using Acoustics
○ Dataset of audio recordings consisting of breathing, cough and speech signals
○ Providing a point-of-care, rapid, easy to use, and cost-effective tool to help contain COVID-19 spread.
Voice Conversion
● Robust disentangled variational speech representation learning for zero-shot
voice conversion (Tencent)
○ Feeding an arbitrary speaker embedding and content embeddings to the VAE decoder
● Controllable Speech Representation Learning Via Voice Conversion and AIC
Loss (Adobe)
○ Its disentangled components (content, pitch, speaker identity, and energy) can be controlled
independently to alter the synthesis result.
● An Investigation of Streaming Non-Autoregressive sequence-to-sequence
Voice Conversion
● Voice Filter: Few-shot text-to-speech speaker adaptation using voice
conversion as a post-processing module (Amazon)
○ It uses voice conversion (VC) as a post-processing module appended to a pre-existing
high-quality TTS system, framing the few-shot TTS problem as a VC task.
Music Applications
● HiFi-SVC: Fast High Fidelity Cross-Domain Singing Voice Conversion
● Music Enhancement via Image Translation and Vocoding (Adobe)
● Source Separation By Steering Pretrained Music Models
● MELONS: generating melody with long-term structure using transformers and
structure graph
● Genre-Conditioned Long-Term 3D Dance Generation Driven by Music
● Deep Performer: Score-to-Audio Music Performance Synthesis (Dolby)
● SleepGAN: Towards Personalized Sleep Therapy Music (Nokia)
● Modeling beats and downbeats with a time-frequency Transformer (ByteDance)
Quantum Machine Learning
● Languages: Google Cirq / Microsoft Q# / IBM Qiskit
● Services: Google Quantum AI / Azure Quantum / IBM Quantum
● The dawn of quantum natural language processing
○ We successfully train a quantum-enhanced Long Short-Term Memory network to perform the
parts-of-speech tagging task via numerical simulations.
○ Practical applications are more likely to be a hybrid of classical and quantum operations. This
hybrid approach is not too different from what has been done in the past decade with GPUs.
○ The main idea behind Quantum Machine Learning (QML) is to replace parts of a neural network
(e.g. linear layers) with a quantum counterpart.
● Quantum federated learning with quantum data
○ Hybrid models fall short when dealing with the highly complex purely quantum data.
○ Thus, purely quantum ML models that can address these challenges were developed, such as
quantum neural networks (QNNs).
○ However, due to the fragile nature of the carriers of quantum data, i.e., qubits, there is a natural
need for distributed learning solutions such as federated learning (FL).
Machine Learning is All You Need
● Audio Representations
○ Learnable Wavelet Packet Transform for Data-Adapted Spectrograms
● Encodings
○ A Low-Parametric Model for Bit-Rate Estimation of VVC Residual Coding
○ Low-Complexity Multi-Model CNN in-Loop Filter for AVS3
● Digital Signal Processing
○ Learning Structured Sparsity For Time-Frequency Reconstruction
○ Learning Approach For Fast Approximate Matrix Factorizations
● Communication Systems
○ Adaptive Wireless Power Allocation with Graph Neural Networks
○ Deep Joint Source-Channel Coding for Wireless Image Transmission with Adaptive Rate Control
● Beamforming
○ Deep learning for location based beamforming with NLOS channels
○ Phase-Only Reconfigurable Sparse Array Beamforming Using Deep Learning
II. Topics related to our tasks
Multilingualism / Cross-lingualism
Joint Unsupervised and Supervised Training for Multilingual
ASR (Google)
● Most existing methods adopt a 2-stage scheme where
the self-supervised loss is optimized in the first
pretraining stage, and the standard supervised
fine-tuning resumes in the second stage.
● In this paper, we propose an end-to-end (E2E) Joint
Unsupervised and Supervised Training (JUST) method
to combine the supervised loss and the
self-supervised contrastive and masked language
modeling (MLM) losses.
● Spectrogram + Quantizer (wav2vec 2.0) + RNN-T
● Wins over XLSR-53!
Pseudo-Labeling for Massively Multilingual Speech
Recognition (Facebook)
● Prev works (from Facebook, similar
○ Iterative Pseudo-Labeling for Speech Recognition
(IPL) → LM + beam search to generate pseudo
○ slimIPL: Language-model-free iterative
pseudo-labeling (slimIPL) → Use self-predictions
● Utilizing unlabeled data is helpful, even
with trivial methods.
Multilingual Text-To-Speech Training Using Cross Language
Voice Conversion And Self-Supervised Learning Of Speech
Representations (Facebook)
● It’s hard to find speakers who have
native proficiency in several
● Using HifiGAN-like model to
augment data (Synthetic generation
of target speaker speaking different
A Configurable Multilingual Model is All You Need to
Recognize All Languages (Microsoft)
● Configurable multilingual model
(CMM) to recognize speech from
any combination of languages
based on a multi-hot LID vector
selected by users
● Language-specific vocabulary
strategy (making vocab smaller)
● Language-specific transformer
cell (one per language)
Zero-Shot Cross-Lingual Transfer Using Multi-Stream
Encoder and Efficient Speaker Representation (Tencent)
● Extract speaker embedding features that are
independent of both content information and
language identity.
● Multi-stream = Input text sequences are fed
into N-stream text encoders in parallel
● zero-shot cross-lingual transfer strategy =
fine-tune also with target-lingual data +
language-balanced sampling strategy
Tackling data scarcity in speech translation using zero-shot
multilingual machine translation techniques
● To tackle data scarcity, it is useful to make use of ASR and MT data for
end-to-end ST models. We explore techniques from zero-shot multilingual text
translation and apply them to speech side.
● Use tokens & augmentation
methods to make the model
decide output language based
on language tokens.
Multi-Lingual Multi-Task Speech Emotion Recognition Using
wav2vec 2.0
● Multi-task learning to increase emotion recognition performance
● Additional tasks
○ Gender Prediction (Ge)
○ Language Prediction (La)
○ F0 mean and standard deviation regression task (F0-me, F0-st)
○ Energy mean and standard deviation regression task (En–me, En-st)
○ Voice ratio regression task (Vr)
ADIMA: Abuse Detection In Multilingual Audio
● ADIMA, a novel, linguistically diverse, ethically sourced, expert annotated and
wellbalanced multilingual abuse detection audio dataset comprising of 11,775
audio samples in 10 Indic languages spanning 65 hours and spoken by 6,446
unique users.
SERAB: A multi-lingual benchmark for speech emotion
● Speech Emotion Recognition Adaptation Benchmark (SERAB), a framework for
evaluating the performance and generalization capacity of different approaches
for utterance-level SER.
Keyword Spotting
Why still keyword spotting?
● For many voice-enabled platforms, queries follow a highly Zipfian distribution.
On the Comcast X1 entertainment system, for example, the top-20 commands
constitute around 30% of the traffic.
● Using an ASR system is excessive for targeting phonetically distinct commands
with a small vocabulary.
● Audio-only based wake word spotting (WWS), a special case of KWS, is
challenging under noisy conditions due to the environmental interference.
Temporal early exiting for streaming speech commands
recognition (Comcast)
● Additionally add prediction heads, stop inference mid-way based on entropy.
A Study of Designing Compact Audio-Visual Wake Word
Spotting System Based on Iterative Fine-Tuning in Neural
Network Pruning
● Audio-visual keyword
● Using both is helpful
Text Adaptive Detection for Customizable Keyword Spotting
● Novel text adaptive detection
framework to directly formulate
KWS as a detection rather than a
classification problem
● Text prompt is used as input, i.e.,
customizable wake words
Joint Ego-Noise Suppression and Keyword Spotting on
Sweeping Robots (Alibaba)
● a novel approach for joint ego-noise (self-created noise) suppression and
keyword detection
● Small footprint keyword spotting (KWS) on sweeping robot, i.e., the
conversation triggering module of the audio interface
● A circular microphone array of M = 6 → Multiple minimum variance
distortionless response (MVDR) beamformers
● If the keyword is present, noise adaptation will be slowed down to prevent
keyword speech being cancelled.
Unified Speculation, Detection, and Verification Keyword
Spotting (Alexa)
● Speculation → early decision (giving a head start, reduce system latency)
● Detection → keyword trigger task, more accurate decision
● Verification → verifies previous decision (correct mistakes)
● The proposed latency-aware max-pooling loss can control latency accuracy
trade-off effectively.
Few-shot / Low-shot
An Adapter Based Pre-Training for Efficient and Scalable
Self-Supervised Speech Representation Learning (Huawei)
Apply adapters (B) to original w2v2 (A) to combat language forgetting.
Efficient Adapter Transfer of Self-Supervised Speech
Models for Automatic Speech Recognition (Huawei)
● Fine-tune on ASR task
● Apply adapters
Large-scale ASR Domain Adaptation by Self-and
Semi-supervised Learning (Google)
● Joint training with both RNN-T &
Self-supervised loss (wav2vec 2.0)
● Confidence Estimation Module (CEM)
→ To filter out low confidence samples in
pseudo-labels for Noisy student training
○ binary cross entropy between the estimated
confidence p and the binary target sequence c
● It utilizes Wav2vec2.0 loss on the causal
encoder, so there is no transition gap from
non-causal to causal.
Learning Domain-Invariant Transformation for Speaker
● Meta-learning to generate domain-invariant embeddings without pre-training
and fine-tuning
● Use both metric loss & classification loss together
Magic dust for cross-lingual adaptation of monolingual
● Monolingual wav2vec-2.0 is a good few-shot ASR learner in several languages
○ English → 8 Target languages
○ Performance up to 86% compared to XLSR
○ ASR Fine-Tuning on English hurts other languages
● Monolingual wav2vec2 model pre-trained on a high-resource language using
moderately-sized unlabeled data and small-sized labeled data in the target
language yields similar performance at XLSR
● Dropout Uncertainty-Driven Self-Training (DUST)
○ Leverages unlabeled data by pseudo-labeling (semi-supervised)
○ Student from a previous round becomes the teacher for the next round
Audio Augmentation
Filteraugment: An acoustic environmental data
augmentation method
● FilterAugment mimics acoustic filters by applying different weights on
frequency bands, therefore enables model to extract relevant information from
wider frequency region.
● Improved version of frequency
masking which masks information
on random frequency bands.
Auditory-Based Data Augmentation for end-to-end
Automatic Speech Recognition
● Spectral smearing smooths the
speech spectrum and suppresses
details by broadening the
bandwidths of the auditory filters.
● Loudness recruitment compresses
amplitudes of different frequency
bands, simulates damaged ear.
Intermix: An Interference-Based Data Augmentation and
Regularization Technique for Automatic Deep Sound
● Prev work: BC learning
○ Taking sound energy into account
● Prev work: SpeechMix
○ Similar to manifold mixup,
mix intermediate representations
● This work: InterMix
○ Also apply phase shifts to inputs
& use it when mixing
Robust Speaker Verification Using Population-Based Data
● A population-based searching strategy for optimizing the augmentation
● Instead of finding a fixed set of hyper-parameters, PBA learns a scheduler for
setting the hyper-parameters.
● List of augmentation used
○ Reverberation: Convolve with room impulse response (RIR)
○ Music: Music from a randomly selected MUSAN
○ Noise: Noise from MUSAN is added
○ Babble: Babble noise is added
○ Frequency masking
○ Time masking
Various augmentations
● LPC Augment: an LPC-based ASR Data Augmentation Algorithm for Low and
Zero-Resource Children's Dialects
○ The data augmentation procedure consists of perturbing the formant peaks of the Linear
predictive coding (LPC) spectrum during LPC analysis and reconstruction.
○ Compared with SpegAug & Speed perturbation. Did not show absolute advantage.
● ImportantAug: A Data Augmentation Agent for Speech
○ Adding noise to unimportant regions of the speech and not to important regions.
○ Importance is predicted for each utterance by a data augmentation agent that is trained to
maximize the amount of noise it adds while minimizing its impact on recognition performance.
● Fraug: A Frame Rate Based Data Augmentation Method for Depression
Detection from Speech Signals
○ Changing the frame-width and the frame-shift parameters during the feature extraction process
Task-specific Augmentations
● Cross-speaker style transfer for text-to-speech using data augmentation
○ Cross-speaker style transfer for TTS using data augmentation via voice conversion
● Spatial mixup: Directional loudness modification as data augmentation for
sound event localization and detection
○ application of parametric spatial audio effects for data augmentation, which modifies the
directional properties of a multi-channel spatial audio signal encoded in the ambisonics domain.
● Spatial Data Augmentation with Simulated Room Impulse Responses for Sound
Event Localization and Detection
○ Augments spatial characteristics using simulated room impulse responses (RIR). simulated RIRs
are convolved with the source signals to obtain an augmented multi-channel training dataset.
● Distribution augmentation for low-resource expressive text-to-speech
○ Data augmentation through word permutations & Constituency parse based tree substitutions
Federated Learning
Federated learning challenges and opportunities: An
outlook (Amazon Alexa)
● Finding the lower limit of the number of communication rounds
○ Many local updates (for communication efficiency) can still converge to a desirable model.
○ Overly aggressive local updates will harm the performance due to the data heterogeneity
● Constraints
○ Memory constraint (each on-device model needs to be small in size)
○ Computation constraint (devices may perform only a limited number of gradient updates)
● Personalized FL
○ Conventional FL trains one model, personalized FL maintains a collection of client-specific
○ Will reduce test errors beyond what is possible with a single global model.
● Challenges of Lifelong FL
○ Online updates with single-pass data
○ Coupling of model training and data generation.
● Challenges on data
○ Data polarity (collected data does not represent the whole data distribution)
○ Data dependency (data are collected from time series with inevitable dependency)
Learnings from Federated Learning in the Real world (Alexa)
● Skewness: “heavy devices” with large amounts of data while there are many
“light users” with only a handful of data points.
● Non-uniform device selection outperforms uniform sampling of FL where it
utilizes the number of input points per device.
● We compare one-shot FL (Uses full range of data, single training) with continual
FL (Avoid storing data, multiple training rounds). We show that continual FL
outperforms the one-shot strategy in some setting, and is overall most
beneficial for heavy devices.
Enabling on-device training of speech recognition models
with federated dropout (Google)
● Communication/computation costs are strongly correlated with the size of the
model being trained. We propose using federated dropout to reduce the size of
client models while training a full-size model server-side.
● Furthermore, we find that federated dropout makes
smaller sub-models to have lower WER, making it
easier to dynamically adjust the model size.
● We use a realistic setting for federated training
of ASR models, wherein a well trained server-side
model is adapted to a new domain with FL on
edge devices.
Federated Self-supervised Learning
● Federated Self-Training for Data-Efficient Audio Recognition (Philips Research)
○ Self-training approach to exploit large-scale on-device unlabeled data to improve the
generalization of audio recognition models
○ Generate pseudo labels & train with softened labels
● Federated Self-Supervised Learning for Acoustic Event Classification (Amazon)
○ Applying FL to improve acoustic event classification (AEC) performance while no customer data
can be directly uploaded to the server
○ No pseudo labels (Common in AEC)
○ Solve the task of predicting the future
audio frame via feature representation

More Related Content

What's hot

【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ
Deep Learning JP
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
【DL輪読会】Incorporating group update for speech enhancement  based on convolutio...【DL輪読会】Incorporating group update for speech enhancement  based on convolutio...
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
Deep Learning JP
Keisuke Imoto
Onoma-to-wave: オノマトペを利用した環境音合成手法の提案
Onoma-to-wave: オノマトペを利用した環境音合成手法の提案Onoma-to-wave: オノマトペを利用した環境音合成手法の提案
Onoma-to-wave: オノマトペを利用した環境音合成手法の提案
Keisuke Imoto
ぱんいち すみもと
Yuki Saito
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
Hiroshi Fukui
[DL輪読会]Deep Learning 第15章 表現学習
[DL輪読会]Deep Learning 第15章 表現学習[DL輪読会]Deep Learning 第15章 表現学習
[DL輪読会]Deep Learning 第15章 表現学習
Deep Learning JP
Semi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learningSemi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learning
Yusuke Uchida
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
Yuki Saito
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
Deep Learning JP
Masahiro Suzuki
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head Decoderネットワーク
Anomaly detection 系の論文を一言でまとめた
Anomaly detection 系の論文を一言でまとめたAnomaly detection 系の論文を一言でまとめた
Anomaly detection 系の論文を一言でまとめた
ぱんいち すみもと
Deep Semi-Supervised Anomaly Detection
Deep Semi-Supervised Anomaly DetectionDeep Semi-Supervised Anomaly Detection
Deep Semi-Supervised Anomaly Detection
ぱんいち すみもと
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
Deep Learning JP
AlphaGo Zero 解説
AlphaGo Zero 解説AlphaGo Zero 解説
AlphaGo Zero 解説
suckgeun lee
Yuki Saito
Shinnosuke Takamichi

What's hot (20)

【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
【DL輪読会】Incorporating group update for speech enhancement  based on convolutio...【DL輪読会】Incorporating group update for speech enhancement  based on convolutio...
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
Onoma-to-wave: オノマトペを利用した環境音合成手法の提案
Onoma-to-wave: オノマトペを利用した環境音合成手法の提案Onoma-to-wave: オノマトペを利用した環境音合成手法の提案
Onoma-to-wave: オノマトペを利用した環境音合成手法の提案
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
[DL輪読会]Deep Learning 第15章 表現学習
[DL輪読会]Deep Learning 第15章 表現学習[DL輪読会]Deep Learning 第15章 表現学習
[DL輪読会]Deep Learning 第15章 表現学習
Semi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learningSemi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learning
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head Decoderネットワーク
Anomaly detection 系の論文を一言でまとめた
Anomaly detection 系の論文を一言でまとめたAnomaly detection 系の論文を一言でまとめた
Anomaly detection 系の論文を一言でまとめた
Deep Semi-Supervised Anomaly Detection
Deep Semi-Supervised Anomaly DetectionDeep Semi-Supervised Anomaly Detection
Deep Semi-Supervised Anomaly Detection
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
AlphaGo Zero 解説
AlphaGo Zero 解説AlphaGo Zero 解説
AlphaGo Zero 解説

Similar to Trends of ICASSP 2022

Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
Sudeep Das, Ph.D.
2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue Systems2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue Systems
#1 Berlin Students in AI, Machine Learning & NLP presentation
#1 Berlin Students in AI, Machine Learning & NLP presentation#1 Berlin Students in AI, Machine Learning & NLP presentation
#1 Berlin Students in AI, Machine Learning & NLP presentation
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Learning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyondLearning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyond
Isabelle Augenstein
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
Universitat Politècnica de Catalunya
Deep Learning | Speaker Indentification
Deep Learning | Speaker IndentificationDeep Learning | Speaker Indentification
Deep Learning | Speaker Indentification
Sai Kiran Kadam
Conversational Agents in Portuguese: A Study Using Deep Learning
Conversational Agents in Portuguese: A Study Using Deep LearningConversational Agents in Portuguese: A Study Using Deep Learning
Conversational Agents in Portuguese: A Study Using Deep Learning
Andherson Maeda
neural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classificationneural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classification
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupDeep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep learning: the future of recommendations
Deep learning: the future of recommendationsDeep learning: the future of recommendations
Deep learning: the future of recommendations
Balázs Hidasi
Esa act
Esa actEsa act
Introduction to Deep Learning presentation
Introduction to Deep Learning presentationIntroduction to Deep Learning presentation
Introduction to Deep Learning presentation
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupDealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Yves Peirsman
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Márton Miháltz
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn..."Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
Edge AI and Vision Alliance

Similar to Trends of ICASSP 2022 (20)

Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue Systems2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue Systems
#1 Berlin Students in AI, Machine Learning & NLP presentation
#1 Berlin Students in AI, Machine Learning & NLP presentation#1 Berlin Students in AI, Machine Learning & NLP presentation
#1 Berlin Students in AI, Machine Learning & NLP presentation
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Learning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyondLearning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyond
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
Deep Learning | Speaker Indentification
Deep Learning | Speaker IndentificationDeep Learning | Speaker Indentification
Deep Learning | Speaker Indentification
Conversational Agents in Portuguese: A Study Using Deep Learning
Conversational Agents in Portuguese: A Study Using Deep LearningConversational Agents in Portuguese: A Study Using Deep Learning
Conversational Agents in Portuguese: A Study Using Deep Learning
neural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classificationneural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classification
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupDeep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep learning: the future of recommendations
Deep learning: the future of recommendationsDeep learning: the future of recommendations
Deep learning: the future of recommendations
Esa act
Esa actEsa act
Esa act
Introduction to Deep Learning presentation
Introduction to Deep Learning presentationIntroduction to Deep Learning presentation
Introduction to Deep Learning presentation
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupDealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn..."Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...

More from Kwanghee Choi

Visual Transformers
Visual TransformersVisual Transformers
Visual Transformers
Kwanghee Choi
추천 시스템 한 발짝 떨어져 살펴보기 (3)
추천 시스템 한 발짝 떨어져 살펴보기 (3)추천 시스템 한 발짝 떨어져 살펴보기 (3)
추천 시스템 한 발짝 떨어져 살펴보기 (3)
Kwanghee Choi
Recommendation systems: Vertical and Horizontal Scrolls
Recommendation systems: Vertical and Horizontal ScrollsRecommendation systems: Vertical and Horizontal Scrolls
Recommendation systems: Vertical and Horizontal Scrolls
Kwanghee Choi
추천 시스템 한 발짝 떨어져 살펴보기 (1)
추천 시스템 한 발짝 떨어져 살펴보기 (1)추천 시스템 한 발짝 떨어져 살펴보기 (1)
추천 시스템 한 발짝 떨어져 살펴보기 (1)
Kwanghee Choi
추천 시스템 한 발짝 떨어져 살펴보기 (2)
추천 시스템 한 발짝 떨어져 살펴보기 (2)추천 시스템 한 발짝 떨어져 살펴보기 (2)
추천 시스템 한 발짝 떨어져 살펴보기 (2)
Kwanghee Choi
Before and After the AI Winter - Recap
Before and After the AI Winter - RecapBefore and After the AI Winter - Recap
Before and After the AI Winter - Recap
Kwanghee Choi
Mastering Gomoku - Recap
Mastering Gomoku - RecapMastering Gomoku - Recap
Mastering Gomoku - Recap
Kwanghee Choi
Teachings of Ada Lovelace
Teachings of Ada LovelaceTeachings of Ada Lovelace
Teachings of Ada Lovelace
Kwanghee Choi
div, grad, curl, and all that - a review
div, grad, curl, and all that - a reviewdiv, grad, curl, and all that - a review
div, grad, curl, and all that - a review
Kwanghee Choi
Gaussian processes
Gaussian processesGaussian processes
Gaussian processes
Kwanghee Choi
Neural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to LearnNeural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to Learn
Kwanghee Choi
Duality between OOP and RL
Duality between OOP and RLDuality between OOP and RL
Duality between OOP and RL
Kwanghee Choi
JFEF encoding
JFEF encodingJFEF encoding
JFEF encoding
Kwanghee Choi
Bandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summaryBandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summary
Kwanghee Choi
Dummy log generation using poisson sampling
 Dummy log generation using poisson sampling Dummy log generation using poisson sampling
Dummy log generation using poisson sampling
Kwanghee Choi
Azure functions: Quickstart
Azure functions: QuickstartAzure functions: Quickstart
Azure functions: Quickstart
Kwanghee Choi
Modern convolutional object detectors
Modern convolutional object detectorsModern convolutional object detectors
Modern convolutional object detectors
Kwanghee Choi
Usage of Moving Average
Usage of Moving AverageUsage of Moving Average
Usage of Moving Average
Kwanghee Choi
Jpl coding standard for the c programming language
Jpl coding standard for the c programming languageJpl coding standard for the c programming language
Jpl coding standard for the c programming language
Kwanghee Choi

More from Kwanghee Choi (19)

Visual Transformers
Visual TransformersVisual Transformers
Visual Transformers
추천 시스템 한 발짝 떨어져 살펴보기 (3)
추천 시스템 한 발짝 떨어져 살펴보기 (3)추천 시스템 한 발짝 떨어져 살펴보기 (3)
추천 시스템 한 발짝 떨어져 살펴보기 (3)
Recommendation systems: Vertical and Horizontal Scrolls
Recommendation systems: Vertical and Horizontal ScrollsRecommendation systems: Vertical and Horizontal Scrolls
Recommendation systems: Vertical and Horizontal Scrolls
추천 시스템 한 발짝 떨어져 살펴보기 (1)
추천 시스템 한 발짝 떨어져 살펴보기 (1)추천 시스템 한 발짝 떨어져 살펴보기 (1)
추천 시스템 한 발짝 떨어져 살펴보기 (1)
추천 시스템 한 발짝 떨어져 살펴보기 (2)
추천 시스템 한 발짝 떨어져 살펴보기 (2)추천 시스템 한 발짝 떨어져 살펴보기 (2)
추천 시스템 한 발짝 떨어져 살펴보기 (2)
Before and After the AI Winter - Recap
Before and After the AI Winter - RecapBefore and After the AI Winter - Recap
Before and After the AI Winter - Recap
Mastering Gomoku - Recap
Mastering Gomoku - RecapMastering Gomoku - Recap
Mastering Gomoku - Recap
Teachings of Ada Lovelace
Teachings of Ada LovelaceTeachings of Ada Lovelace
Teachings of Ada Lovelace
div, grad, curl, and all that - a review
div, grad, curl, and all that - a reviewdiv, grad, curl, and all that - a review
div, grad, curl, and all that - a review
Gaussian processes
Gaussian processesGaussian processes
Gaussian processes
Neural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to LearnNeural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to Learn
Duality between OOP and RL
Duality between OOP and RLDuality between OOP and RL
Duality between OOP and RL
JFEF encoding
JFEF encodingJFEF encoding
JFEF encoding
Bandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summaryBandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summary
Dummy log generation using poisson sampling
 Dummy log generation using poisson sampling Dummy log generation using poisson sampling
Dummy log generation using poisson sampling
Azure functions: Quickstart
Azure functions: QuickstartAzure functions: Quickstart
Azure functions: Quickstart
Modern convolutional object detectors
Modern convolutional object detectorsModern convolutional object detectors
Modern convolutional object detectors
Usage of Moving Average
Usage of Moving AverageUsage of Moving Average
Usage of Moving Average
Jpl coding standard for the c programming language
Jpl coding standard for the c programming languageJpl coding standard for the c programming language
Jpl coding standard for the c programming language

Recently uploaded

Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix

Recently uploaded (20)

Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support

Trends of ICASSP 2022

  • 2. Before we start ● ICASSP? ○ De-facto conference for audio/speech (with InterSpeech) ● 2022 ICASSP Stats ○ 3967 papers submitted, 1785 (45.0%) accepted ○ ○ About ~100 papers skimmed ● Slides ○ General topics & Service-related topics ○ Can’t go through everything, will try to finish quickly
  • 3. Contents 1. General Trend 1. 2022 SotA 2. Contrastive / Self-supervised 3. Security 4. Post-COVID Teleconferencing 5. Applications 2. Topics related to our tasks 1. Multilingualism / Cross-lingualism 2. Keyword Spotting 3. Few-shot / Low-shot 4. Audio Augmentation 5. Federated Learning
  • 6. General Models ● Wav2vec (Facebook) ○ Raw audio + 1D Convs + BERT + CTC Loss (Similar to MLM, feature distractors) ● Wav2vec 2.0 (Facebook) ○ Raw audio + 1D Convs + BERT + Codebooks + Contrastive loss (quantized distractors) ● HuBERT (Facebook) ○ Raw audio + 1D Convs + BERT + MFCC-based Clustering + Cluster Ensembles ● BigSSL/CAP12 (Google) ○ Spectrogram + 1D Convs + Conformers + Contrastive loss (quantized distractors) ● Data2vec (Facebook) ○ Raw audio + 1D Convs + BERT + Student-teacher (EMA) prediction ● WavLM (Microsoft) ○ HuBERT + Speech denoising + Gated relative position bias
  • 7. GLUE-like Benchmarks ● SUPERB (NTU) ○ Speech processing Universal PERformance Benchmark ○ Recognition: Phoneme Recognition, Automatic Speech Recognition ○ Detection: Keyword Spotting, Query by Example Spoken Term Detection ○ Semantics: Intent Classification, Slot Filling, Speech Translation ○ Speaker: Speaker Identification, Automatic Speaker Verification, Speaker Diarization ○ Paralinguistics: Emotion Recognition ○ Generation: Speech enhancement, Speech Separation ● NOSS (Google) ○ NOn-Semantic Speech Benchmark ○ Speaker identification, Language identification, Command, Emotion, Dementia/healthy ● HARES (Deepmind) ○ Holistic audio representation evaluation suite ○ Environment: Audio tagging, Animal/Scene classification ○ Speech: Keyword Spotting, Intention Classification, Language identification, Speaker identification ○ Music: Instrument Identification, Pitch estimation, Music tagging
  • 8. 1.2 Contrastive / Self-supervised
  • 9. Towards Learning Universal Audio Representations (DeepMind) ● HARES: New BLEU-like benchmark ● SlowFast (from Video) + NFNet (from Vision) seems to be great. ○ SlowFast: Two branches with bigger/smaller kernel width ○ NFNet: Normalizer-Free ResNets ● CPC (contrastive learning) works quite well. entations-ed8774b85de143c097175b3646cd84e1
  • 10. Universal paralinguistic speech representations using self-supervised conformers (Google) ● Contrastive learning (of w2v2) on Conformers ○ Future works already conducted on distilling this model (TRILLsson) ○ ● Closely following previous work BigSSL (Google) ○ Trained with speech-heavy youtube videos ○ Their conclusion: SSL + Large Models are especially helpful for small datasets ● Best performance wasn’t from the final layer’s feature vector. (Same conclusion from BigSSL) → CAP12 (12th layer feature outputs) 4369ab34cc5237603393
  • 11. A Noise-Robust Self-supervised Pre-training Model Based Speech Representation Learning for Automatic Speech Recognition Making the w2v feature encoder robust to additional noise via contrastive loss e-training-Model-Based-Speech-Representation-Learning-for-Aut omatic-537ec0ccbd874303840b582db90a3a9d
  • 12. Wav2vec-switch: Contrastive learning from original-noisy speech pairs for robust speech recognition (Microsoft) ● Contextualized representation being robust to noise Original w2v2 loss
  • 13. DistilHuBERT: Speech representation learning by layer-wise distillation of hidden-unit BERT (NTU)
  • 14. Improving Self-Supervised Learning for Speech Recognition with Intermediate Layer Supervision (Microsoft) ● The common practice of SSL is to compute the self-supervised loss on the top layer, such as wav2vec 2.0 and HuBERT. ● However, the lower layers of such a pre-trained model is shown to have a low correlation with phonetic information. ● In this work, we propose to apply intermediate layer supervision to encourage lower layers to learn content knowledge → Apply exact same thing of HUBERT to lower layers
  • 15. Exploring Heterogeneous Characteristics of Layers in ASR Models for More Efficient Training (Google) ● Based on “Are All Layers Created Equal?” (Bengio) ○ Fix intermediate layers’ weights to some other weights ○ Re-initialization: Come back to initial values ○ Re-randomization: Get random weights ● While ambient layers were present in all model sizes, we observed that larger models had more ambient layers, i.e., overparameterized models. ● During early rounds, the ambient layers were more spread throughout the model; only later the separation become more distinct. ● GN was more robust (against re-random) than BN.
  • 16. Investigation of Robustness of Hubert Features from Different Layers to Domain, Accent and Language Variations ● Our experiments indicate that as domain, accent, bandwidth and language deviates from the source domain, the relative improvement decreases. ● The last layer of HuBERT is very specific to the dataset on which it is trained. The second last layer seems to be better when there is domain and accent differences. ● Middle layers are more suited when data is from a different language.
  • 17. Don't speak too fast: The impact of data bias on self-supervised speech models (NTU) ● Use SUPERB benchmark to differ Gender, Content, Speech speed of pre-trained datasets ● Gender → Adding few minor class samples will mitigate performance drop ● Content → Model didn’t care perplexity ● Speech speed → Faster speech is worse
  • 19. Speech anonymization (Emmanuel Vincent) ● Speech information ○ Verbal content (identifiers, private info, etc) ○ Speaker (identity, gender, age, ethnic origin, etc) ○ Nonverbal content (emotion, health, etc) ○ Acoustic environment (acoustics, other speakers, etc) ● Risks ○ User profiling, user identification, voice cloning, information leakage ● Methods ○ Embedded systems, Cryptography, Obfuscation, Anonymization, Federated Learning, etc ○ Simple modifications (ex. Pitch shifts) utterly fail for knowledgeable attackers ● Current speech anonymization challenge != Legal defn. ○ It seems that many big companies doesn’t anonymize speech (collected from various sources) ○ Task: (1) ASR (2) Emotion recognition
  • 20. Preserving Trajectory Privacy in Driving Data Release ● What comes with the innovative services provided by intelligent transport systems (ITS) are potential privacy attacks. ● For example, in traffic monitoring systems, individual users send anonymized personal location traces continuously to aid in traffic state estimation. ● However, an adversary may link an anonymous GPS trace to a particular person provided additional knowledge of the person’s residence or working location. ● This can not be achieved by data encryption or hiding the driver identity. We resort to the notion of inference privacy that sanitizes raw data to limit the amount of contained private information.
  • 21. Audio Deepfake Detection 2022: the First Audio Deep Synthesis Detection Challenge ● ● Low-quality fake audio detection: focuses on dealing with bona fide and fully fake utterances with various real-world noises etc ○ Fully generated utterances ● Partially fake audio detection: distinguish the partially fake audio from the real ○ Generated by manipulating the genuine utterances ● Audio fake game: Solve both an audio generation task and an audio fake detection task
  • 22. Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks (Naver) ● Spoofing detection solutions can be an important consideration when automatic speaker verification systems are deployed in real-world applications. ● Two major scenarios: ○ Logical access (LA): spoofing attacks mounted with voice conversion and TTS ○ Physical access (PA): bona fide utterances are captured and then replayed ● Recent studies show that discriminative information (i.e., spoofing artefacts) can reside in specific temporal and spectral intervals
  • 23. Characterizing the adversarial vulnerability of speech self-supervised learning (Helen Meng) ● Speech processing Universal PERformance Benchmark (SUPERB) ○ Upstream model (self-supervised models) + Downstream models (directly uses features, ex. finetuning) ● Adversarial Attacks ○ Limited-knowledge adversaries: Attackers can access the internals of the target model (parameters and gradients). But they do not know which downstream task will be conducted. ○ Zero-knowledge adversaries: Target model is unavailable to the attackers. In such a case, the substitute model is used for approximating gradients for adversarial sample generation. ○ XAB listening test: check if humans can distinguish adversarial samples ● Results: Attacks are effective, humans cannot easily distinguish.
  • 24. Adversarial Sample Detection for Speaker Verification by Neural Vocoders (Tencent) ● Automatic speaker verification (ASV), one of the most important technology for biometric identification, has been widely adopted in security-critical applications. ● However, ASV is seriously vulnerable to recently emerged adversarial attacks, yet effective countermeasures against them are limited.
  • 25. Source Mixing and Separation Robust Audio Steganography (Sony) ● Audio steganography is the science of concealing secret messages inside a host audio called a carrier in such a way that the concealment is unnoticeable to human ears. ● Recently, deep neural networks (DNNs) have been used as a steganographic function for hiding data inside images to achieve high capacity. ● The network learns to conceal a hidden message inside the carrier without manually specifying a particular redundancy to exploit. PixInWav: Residual Steganography for Hiding Pixels in Audio
  • 26. Exploiting language model for efficient linguistic steganalysis ● Linguistic steganography (LS) ○ Natural language is actually quite suitable for steganography. ○ The advantage is that LS can be easily concealed by the huge number of social activities. ○ (1) modification based and (2) generation based ○ Latter allows more data to be embedded ● Steganalysis = to detect whether there is secret data embedded in the media ● Significant difference between automatically generated stego texts and carrier texts in terms of the conditional probability distribution of individual words.
  • 28. Acoustic Echo Cancellation ● Acoustic echo refers to the phenomenon that occurs when a microphone picks up the far-end signal that is played by a loudspeaker. ● This phenomenon can cause a slight annoyance or a significant breakdown in a communication system. ● ICASSP 2022 AEC Challenge by Microsoft ● Various scenarios ○ Long- or varying delays ○ Strong speaker/mic distortions ○ Stationary/non-stationary noise ○ Glitches (due to high CPU usages) ○ etc.
  • 29. Deep Noise Suppression ● Audio calls in the presence of background noises get significantly degraded in terms of quality/intelligibility of the perceived speech. ● ICASSP 2022 Deep Noise Suppression Challenge by Microsoft
  • 30. Multi-Channel Multi-Party Meeting Transcription ● Speaker Diarization ○ Partitioning an input audio stream into homogeneous segments according to the speaker identity, i.e. "who spoke when?” ● Multi-speaker ASR ○ Hard to do overlapped speech recognition due to the interfering speakers or background noise ● ICASSP 2022 M2MeT Challenge by Alibaba
  • 31. VarArray: Array-geometry-agnostic continuous speech separation (Microsoft) ● Continuous speech separation using a microphone array was shown to be promising in dealing with the speech overlap problem. ● Signals highly depend on the position of the microphones. ● In meetings, we can assume only two or fewer speakers to be active for the majority of the meeting time.
  • 32. Multimodal Systems ● Audio-Visual Object Classification For Human-Robot Collaboration ● Multimodal Information Based Speech Processing ● Machine Translation for Spoken and Written Language ● Image and Video Understanding ● Multimodal Signal Processing, Analysis, and Synthesis ● Audio Security and Multi-Modal Systems ● Multi-modal Analysis and Synthesis ● Multimodal Data Fusion and Processing ● Multimodal Analysis in Audio Applications
  • 34. Emotion Recognition ● Speech emotion recognition using self-supervised features ○ A modular End-to-End SER system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration of a large variety of self-supervised features. ● Memobert: Pre-training model with prompt-based learning for multimodal emotion recognition ○ learns multimodal joint representations through self-supervised learning ○ prompt-based method that reformulates emotion classification as a masked text prediction ● Multimodal Emotion Recognition with Surgical and Fabric Masks ○ investigate how muffled speech and occluded facial expressions change the prediction of emotions
  • 35. Speech as a Disease Biomarker ● Fraug: A Frame Rate Based Data Augmentation Method for Depression Detection from Speech Signals ○ Among others, the speech signal is an important biomarker of our mental state and can be collected remotely, in a non-invasive manner with no expert supervision. ○ Recently, speech-based automatic diagnosis of depression has gained significant momentum. ● Exploring Dementia Detection from Speech: Cross Corpus Analysis ○ Population aging is responsible for an increase of new Alzheimer’s disease (AD) cases, and creates the need for scalable, cost-effective methods that are able to detect early stage AD. ○ Speech and language biomarkers are strong indicators of dementia, and provide a low-cost and widespread alternative for the assessment of cognitive states. ● The Second Dicova Challenge: Dataset and Performance Analysis for Diagnosis of Covid-19 Using Acoustics ○ Dataset of audio recordings consisting of breathing, cough and speech signals ○ Providing a point-of-care, rapid, easy to use, and cost-effective tool to help contain COVID-19 spread.
  • 36. Voice Conversion ● Robust disentangled variational speech representation learning for zero-shot voice conversion (Tencent) ○ Feeding an arbitrary speaker embedding and content embeddings to the VAE decoder ● Controllable Speech Representation Learning Via Voice Conversion and AIC Loss (Adobe) ○ Its disentangled components (content, pitch, speaker identity, and energy) can be controlled independently to alter the synthesis result. ● An Investigation of Streaming Non-Autoregressive sequence-to-sequence Voice Conversion ● Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module (Amazon) ○ It uses voice conversion (VC) as a post-processing module appended to a pre-existing high-quality TTS system, framing the few-shot TTS problem as a VC task.
  • 37. Music Applications ● HiFi-SVC: Fast High Fidelity Cross-Domain Singing Voice Conversion ● Music Enhancement via Image Translation and Vocoding (Adobe) ● Source Separation By Steering Pretrained Music Models ● MELONS: generating melody with long-term structure using transformers and structure graph ● Genre-Conditioned Long-Term 3D Dance Generation Driven by Music ● Deep Performer: Score-to-Audio Music Performance Synthesis (Dolby) ● SleepGAN: Towards Personalized Sleep Therapy Music (Nokia) ● Modeling beats and downbeats with a time-frequency Transformer (ByteDance)
  • 38. Quantum Machine Learning ● Languages: Google Cirq / Microsoft Q# / IBM Qiskit ● Services: Google Quantum AI / Azure Quantum / IBM Quantum ● The dawn of quantum natural language processing ○ We successfully train a quantum-enhanced Long Short-Term Memory network to perform the parts-of-speech tagging task via numerical simulations. ○ Practical applications are more likely to be a hybrid of classical and quantum operations. This hybrid approach is not too different from what has been done in the past decade with GPUs. ○ The main idea behind Quantum Machine Learning (QML) is to replace parts of a neural network (e.g. linear layers) with a quantum counterpart. ● Quantum federated learning with quantum data ○ Hybrid models fall short when dealing with the highly complex purely quantum data. ○ Thus, purely quantum ML models that can address these challenges were developed, such as quantum neural networks (QNNs). ○ However, due to the fragile nature of the carriers of quantum data, i.e., qubits, there is a natural need for distributed learning solutions such as federated learning (FL).
  • 39. Machine Learning is All You Need ● Audio Representations ○ Learnable Wavelet Packet Transform for Data-Adapted Spectrograms ● Encodings ○ A Low-Parametric Model for Bit-Rate Estimation of VVC Residual Coding ○ Low-Complexity Multi-Model CNN in-Loop Filter for AVS3 ● Digital Signal Processing ○ Learning Structured Sparsity For Time-Frequency Reconstruction ○ Learning Approach For Fast Approximate Matrix Factorizations ● Communication Systems ○ Adaptive Wireless Power Allocation with Graph Neural Networks ○ Deep Joint Source-Channel Coding for Wireless Image Transmission with Adaptive Rate Control ● Beamforming ○ Deep learning for location based beamforming with NLOS channels ○ Phase-Only Reconfigurable Sparse Array Beamforming Using Deep Learning
  • 40. II. Topics related to our tasks
  • 42. Joint Unsupervised and Supervised Training for Multilingual ASR (Google) ● Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard supervised fine-tuning resumes in the second stage. ● In this paper, we propose an end-to-end (E2E) Joint Unsupervised and Supervised Training (JUST) method to combine the supervised loss and the self-supervised contrastive and masked language modeling (MLM) losses. ● Spectrogram + Quantizer (wav2vec 2.0) + RNN-T ● Wins over XLSR-53!
  • 43. Pseudo-Labeling for Massively Multilingual Speech Recognition (Facebook) ● Prev works (from Facebook, similar authors) ○ Iterative Pseudo-Labeling for Speech Recognition (IPL) → LM + beam search to generate pseudo labels ○ slimIPL: Language-model-free iterative pseudo-labeling (slimIPL) → Use self-predictions ● Utilizing unlabeled data is helpful, even with trivial methods.
  • 44. Multilingual Text-To-Speech Training Using Cross Language Voice Conversion And Self-Supervised Learning Of Speech Representations (Facebook) ● It’s hard to find speakers who have native proficiency in several languages. ● Using HifiGAN-like model to augment data (Synthetic generation of target speaker speaking different language)
  • 45. A Configurable Multilingual Model is All You Need to Recognize All Languages (Microsoft) ● Configurable multilingual model (CMM) to recognize speech from any combination of languages based on a multi-hot LID vector selected by users ● Language-specific vocabulary strategy (making vocab smaller) ● Language-specific transformer cell (one per language)
  • 46. Zero-Shot Cross-Lingual Transfer Using Multi-Stream Encoder and Efficient Speaker Representation (Tencent) ● Extract speaker embedding features that are independent of both content information and language identity. ● Multi-stream = Input text sequences are fed into N-stream text encoders in parallel ● zero-shot cross-lingual transfer strategy = fine-tune also with target-lingual data + language-balanced sampling strategy
  • 47. Tackling data scarcity in speech translation using zero-shot multilingual machine translation techniques ● To tackle data scarcity, it is useful to make use of ASR and MT data for end-to-end ST models. We explore techniques from zero-shot multilingual text translation and apply them to speech side. ● Use tokens & augmentation methods to make the model decide output language based on language tokens.
  • 48. Multi-Lingual Multi-Task Speech Emotion Recognition Using wav2vec 2.0 ● Multi-task learning to increase emotion recognition performance ● Additional tasks ○ Gender Prediction (Ge) ○ Language Prediction (La) ○ F0 mean and standard deviation regression task (F0-me, F0-st) ○ Energy mean and standard deviation regression task (En–me, En-st) ○ Voice ratio regression task (Vr)
  • 49. ADIMA: Abuse Detection In Multilingual Audio ● ADIMA, a novel, linguistically diverse, ethically sourced, expert annotated and wellbalanced multilingual abuse detection audio dataset comprising of 11,775 audio samples in 10 Indic languages spanning 65 hours and spoken by 6,446 unique users.
  • 50. SERAB: A multi-lingual benchmark for speech emotion recognition ● Speech Emotion Recognition Adaptation Benchmark (SERAB), a framework for evaluating the performance and generalization capacity of different approaches for utterance-level SER.
  • 52. Why still keyword spotting? ● For many voice-enabled platforms, queries follow a highly Zipfian distribution. On the Comcast X1 entertainment system, for example, the top-20 commands constitute around 30% of the traffic. ● Using an ASR system is excessive for targeting phonetically distinct commands with a small vocabulary. ● Audio-only based wake word spotting (WWS), a special case of KWS, is challenging under noisy conditions due to the environmental interference.
  • 53. Temporal early exiting for streaming speech commands recognition (Comcast) ● Additionally add prediction heads, stop inference mid-way based on entropy.
  • 54. A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning ● Audio-visual keyword spotting ● Using both is helpful
  • 55. Text Adaptive Detection for Customizable Keyword Spotting ● Novel text adaptive detection framework to directly formulate KWS as a detection rather than a classification problem ● Text prompt is used as input, i.e., customizable wake words
  • 56. Joint Ego-Noise Suppression and Keyword Spotting on Sweeping Robots (Alibaba) ● a novel approach for joint ego-noise (self-created noise) suppression and keyword detection ● Small footprint keyword spotting (KWS) on sweeping robot, i.e., the conversation triggering module of the audio interface ● A circular microphone array of M = 6 → Multiple minimum variance distortionless response (MVDR) beamformers ● If the keyword is present, noise adaptation will be slowed down to prevent keyword speech being cancelled.
  • 57. Unified Speculation, Detection, and Verification Keyword Spotting (Alexa) ● Speculation → early decision (giving a head start, reduce system latency) ● Detection → keyword trigger task, more accurate decision ● Verification → verifies previous decision (correct mistakes) ● The proposed latency-aware max-pooling loss can control latency accuracy trade-off effectively.
  • 59. An Adapter Based Pre-Training for Efficient and Scalable Self-Supervised Speech Representation Learning (Huawei) Apply adapters (B) to original w2v2 (A) to combat language forgetting. 6747a578d4899b914e520959e01e8
  • 60. Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition (Huawei) ● Fine-tune on ASR task ● Apply adapters
  • 61. Large-scale ASR Domain Adaptation by Self-and Semi-supervised Learning (Google) ● Joint training with both RNN-T & Self-supervised loss (wav2vec 2.0) ● Confidence Estimation Module (CEM) → To filter out low confidence samples in pseudo-labels for Noisy student training ○ binary cross entropy between the estimated confidence p and the binary target sequence c ● It utilizes Wav2vec2.0 loss on the causal encoder, so there is no transition gap from non-causal to causal.
  • 62. Learning Domain-Invariant Transformation for Speaker Verification ● Meta-learning to generate domain-invariant embeddings without pre-training and fine-tuning ● Use both metric loss & classification loss together
  • 63. Magic dust for cross-lingual adaptation of monolingual wav2vec-2.0 ● Monolingual wav2vec-2.0 is a good few-shot ASR learner in several languages ○ English → 8 Target languages ○ Performance up to 86% compared to XLSR ○ ASR Fine-Tuning on English hurts other languages ● Monolingual wav2vec2 model pre-trained on a high-resource language using moderately-sized unlabeled data and small-sized labeled data in the target language yields similar performance at XLSR ● Dropout Uncertainty-Driven Self-Training (DUST) ○ Leverages unlabeled data by pseudo-labeling (semi-supervised) ○ Student from a previous round becomes the teacher for the next round
  • 65. Filteraugment: An acoustic environmental data augmentation method ● FilterAugment mimics acoustic filters by applying different weights on frequency bands, therefore enables model to extract relevant information from wider frequency region. ● Improved version of frequency masking which masks information on random frequency bands.
  • 66. Auditory-Based Data Augmentation for end-to-end Automatic Speech Recognition ● Spectral smearing smooths the speech spectrum and suppresses details by broadening the bandwidths of the auditory filters. ● Loudness recruitment compresses amplitudes of different frequency bands, simulates damaged ear.
  • 67. Intermix: An Interference-Based Data Augmentation and Regularization Technique for Automatic Deep Sound Classification ● Prev work: BC learning ○ Taking sound energy into account ● Prev work: SpeechMix ○ Similar to manifold mixup, mix intermediate representations ● This work: InterMix ○ Also apply phase shifts to inputs & use it when mixing
  • 68. Robust Speaker Verification Using Population-Based Data Augmentation ● A population-based searching strategy for optimizing the augmentation parameters. ● Instead of finding a fixed set of hyper-parameters, PBA learns a scheduler for setting the hyper-parameters. ● List of augmentation used ○ Reverberation: Convolve with room impulse response (RIR) ○ Music: Music from a randomly selected MUSAN ○ Noise: Noise from MUSAN is added ○ Babble: Babble noise is added ○ Frequency masking ○ Time masking
  • 69. Various augmentations ● LPC Augment: an LPC-based ASR Data Augmentation Algorithm for Low and Zero-Resource Children's Dialects ○ The data augmentation procedure consists of perturbing the formant peaks of the Linear predictive coding (LPC) spectrum during LPC analysis and reconstruction. ○ Compared with SpegAug & Speed perturbation. Did not show absolute advantage. ● ImportantAug: A Data Augmentation Agent for Speech ○ Adding noise to unimportant regions of the speech and not to important regions. ○ Importance is predicted for each utterance by a data augmentation agent that is trained to maximize the amount of noise it adds while minimizing its impact on recognition performance. ● Fraug: A Frame Rate Based Data Augmentation Method for Depression Detection from Speech Signals ○ Changing the frame-width and the frame-shift parameters during the feature extraction process
  • 70. Task-specific Augmentations ● Cross-speaker style transfer for text-to-speech using data augmentation ○ Cross-speaker style transfer for TTS using data augmentation via voice conversion ● Spatial mixup: Directional loudness modification as data augmentation for sound event localization and detection ○ application of parametric spatial audio effects for data augmentation, which modifies the directional properties of a multi-channel spatial audio signal encoded in the ambisonics domain. ● Spatial Data Augmentation with Simulated Room Impulse Responses for Sound Event Localization and Detection ○ Augments spatial characteristics using simulated room impulse responses (RIR). simulated RIRs are convolved with the source signals to obtain an augmented multi-channel training dataset. ● Distribution augmentation for low-resource expressive text-to-speech ○ Data augmentation through word permutations & Constituency parse based tree substitutions
  • 72. Federated learning challenges and opportunities: An outlook (Amazon Alexa) ● Finding the lower limit of the number of communication rounds ○ Many local updates (for communication efficiency) can still converge to a desirable model. ○ Overly aggressive local updates will harm the performance due to the data heterogeneity ● Constraints ○ Memory constraint (each on-device model needs to be small in size) ○ Computation constraint (devices may perform only a limited number of gradient updates) ● Personalized FL ○ Conventional FL trains one model, personalized FL maintains a collection of client-specific models ○ Will reduce test errors beyond what is possible with a single global model. ● Challenges of Lifelong FL ○ Online updates with single-pass data ○ Coupling of model training and data generation. ● Challenges on data ○ Data polarity (collected data does not represent the whole data distribution) ○ Data dependency (data are collected from time series with inevitable dependency)
  • 73. Learnings from Federated Learning in the Real world (Alexa) ● Skewness: “heavy devices” with large amounts of data while there are many “light users” with only a handful of data points. ● Non-uniform device selection outperforms uniform sampling of FL where it utilizes the number of input points per device. ● We compare one-shot FL (Uses full range of data, single training) with continual FL (Avoid storing data, multiple training rounds). We show that continual FL outperforms the one-shot strategy in some setting, and is overall most beneficial for heavy devices.
  • 74. Enabling on-device training of speech recognition models with federated dropout (Google) ● Communication/computation costs are strongly correlated with the size of the model being trained. We propose using federated dropout to reduce the size of client models while training a full-size model server-side. ● Furthermore, we find that federated dropout makes smaller sub-models to have lower WER, making it easier to dynamically adjust the model size. ● We use a realistic setting for federated training of ASR models, wherein a well trained server-side model is adapted to a new domain with FL on edge devices.
  • 75. Federated Self-supervised Learning ● Federated Self-Training for Data-Efficient Audio Recognition (Philips Research) ○ Self-training approach to exploit large-scale on-device unlabeled data to improve the generalization of audio recognition models ○ Generate pseudo labels & train with softened labels ● Federated Self-Supervised Learning for Acoustic Event Classification (Amazon) ○ Applying FL to improve acoustic event classification (AEC) performance while no customer data can be directly uploaded to the server ○ No pseudo labels (Common in AEC) ○ Solve the task of predicting the future audio frame via feature representation
  • 76. EOD