SlideShare a Scribd company logo
1 of 76
Download to read offline
ICASSP 2022
Kwanghee Choi
Before we start
● ICASSP?
○ De-facto conference for audio/speech (with InterSpeech)
● 2022 ICASSP Stats
○ 3967 papers submitted, 1785 (45.0%) accepted
○ https://github.com/lixin4ever/Conference-Acceptance-Rate
○ About ~100 papers skimmed
● Slides
○ General topics & Service-related topics
○ Can’t go through everything, will try to finish quickly
Contents
1. General Trend
1. 2022 SotA
2. Contrastive / Self-supervised
3. Security
4. Post-COVID Teleconferencing
5. Applications
2. Topics related to our tasks
1. Multilingualism / Cross-lingualism
2. Keyword Spotting
3. Few-shot / Low-shot
4. Audio Augmentation
5. Federated Learning
I. General Trend
1.1 2022 SotA
General Models
● Wav2vec (Facebook)
○ Raw audio + 1D Convs + BERT + CTC Loss (Similar to MLM, feature distractors)
● Wav2vec 2.0 (Facebook)
○ Raw audio + 1D Convs + BERT + Codebooks + Contrastive loss (quantized distractors)
● HuBERT (Facebook)
○ Raw audio + 1D Convs + BERT + MFCC-based Clustering + Cluster Ensembles
● BigSSL/CAP12 (Google)
○ Spectrogram + 1D Convs + Conformers + Contrastive loss (quantized distractors)
● Data2vec (Facebook)
○ Raw audio + 1D Convs + BERT + Student-teacher (EMA) prediction
● WavLM (Microsoft)
○ HuBERT + Speech denoising + Gated relative position bias
GLUE-like Benchmarks
● SUPERB (NTU)
○ Speech processing Universal PERformance Benchmark https://superbbenchmark.org/
○ Recognition: Phoneme Recognition, Automatic Speech Recognition
○ Detection: Keyword Spotting, Query by Example Spoken Term Detection
○ Semantics: Intent Classification, Slot Filling, Speech Translation
○ Speaker: Speaker Identification, Automatic Speaker Verification, Speaker Diarization
○ Paralinguistics: Emotion Recognition
○ Generation: Speech enhancement, Speech Separation
● NOSS (Google)
○ NOn-Semantic Speech Benchmark
https://ai.googleblog.com/2020/06/improving-speech-representations-and.html
○ Speaker identification, Language identification, Command, Emotion, Dementia/healthy
● HARES (Deepmind)
○ Holistic audio representation evaluation suite
○ Environment: Audio tagging, Animal/Scene classification
○ Speech: Keyword Spotting, Intention Classification, Language identification, Speaker identification
○ Music: Instrument Identification, Pitch estimation, Music tagging
1.2 Contrastive / Self-supervised
Towards Learning Universal Audio Representations
(DeepMind)
● HARES: New BLEU-like benchmark
● SlowFast (from Video) + NFNet
(from Vision) seems to be great.
○ SlowFast: Two branches with
bigger/smaller kernel width
○ NFNet: Normalizer-Free ResNets
● CPC (contrastive learning) works
quite well.
https://www.notion.so/hpcnt/Towards-Learning-Universal-Audio-Repres
entations-ed8774b85de143c097175b3646cd84e1
Universal paralinguistic speech representations using
self-supervised conformers (Google)
● Contrastive learning (of w2v2) on Conformers
○ Future works already conducted on distilling this model (TRILLsson)
○ https://ai.googleblog.com/2022/03/trillsson-small-universal-speech.html
● Closely following previous work BigSSL (Google)
○ Trained with speech-heavy youtube videos
○ Their conclusion: SSL + Large Models are especially helpful for small datasets
● Best performance wasn’t from the
final layer’s feature vector.
(Same conclusion from BigSSL)
→ CAP12 (12th layer feature outputs)
https://www.notion.so/hpcnt/Universal-paralinguistic-speech-representations-using-self-supervised-conformers-d621d75b95eb
4369ab34cc5237603393
A Noise-Robust Self-supervised Pre-training Model Based
Speech Representation Learning for Automatic Speech
Recognition
Making the w2v feature encoder robust to
additional noise via contrastive loss
https://www.notion.so/hpcnt/A-Noise-Robust-Self-supervised-Pr
e-training-Model-Based-Speech-Representation-Learning-for-Aut
omatic-537ec0ccbd874303840b582db90a3a9d
Wav2vec-switch: Contrastive learning from original-noisy
speech pairs for robust speech recognition (Microsoft)
● Contextualized representation being robust to noise
Original
w2v2 loss
DistilHuBERT: Speech representation learning by layer-wise
distillation of hidden-unit BERT (NTU)
Improving Self-Supervised Learning for Speech Recognition
with Intermediate Layer Supervision (Microsoft)
● The common practice of SSL is to compute the
self-supervised loss on the top layer, such as
wav2vec 2.0 and HuBERT.
● However, the lower layers of such a pre-trained
model is shown to have a low correlation with
phonetic information.
● In this work, we propose to apply intermediate
layer supervision to encourage lower layers to
learn content knowledge → Apply exact same
thing of HUBERT to lower layers
Exploring Heterogeneous Characteristics of Layers in ASR
Models for More Efficient Training (Google)
● Based on “Are All Layers Created Equal?” (Bengio)
○ Fix intermediate layers’ weights to some other weights
○ Re-initialization: Come back to initial values
○ Re-randomization: Get random weights
● While ambient layers were present in all model sizes, we observed that larger
models had more ambient layers, i.e., overparameterized models.
● During early rounds, the ambient layers were more
spread throughout the model; only later
the separation become more distinct.
● GN was more robust (against re-random) than BN.
Investigation of Robustness of Hubert Features from
Different Layers to Domain, Accent and Language Variations
● Our experiments indicate that as domain, accent,
bandwidth and language deviates from the source domain,
the relative improvement decreases.
● The last layer of HuBERT is very specific to the dataset on
which it is trained. The second last layer seems to be better
when there is domain and accent differences.
● Middle layers are more suited when data is from a different
language.
Don't speak too fast: The impact of data bias on
self-supervised speech models (NTU)
● Use SUPERB benchmark to differ Gender, Content, Speech speed of
pre-trained datasets
● Gender → Adding few minor class samples will mitigate performance drop
● Content → Model didn’t care perplexity
● Speech speed → Faster speech is worse
Security
Speech anonymization (Emmanuel Vincent)
● Speech information
○ Verbal content (identifiers, private info, etc)
○ Speaker (identity, gender, age, ethnic origin, etc)
○ Nonverbal content (emotion, health, etc)
○ Acoustic environment (acoustics, other speakers, etc)
● Risks
○ User profiling, user identification, voice cloning,
information leakage
● Methods
○ Embedded systems, Cryptography, Obfuscation, Anonymization, Federated Learning, etc
○ Simple modifications (ex. Pitch shifts) utterly fail for knowledgeable attackers
● Current speech anonymization challenge != Legal defn.
○ It seems that many big companies doesn’t anonymize speech (collected from various sources)
○ Task: (1) ASR (2) Emotion recognition
Preserving Trajectory Privacy in Driving Data Release
● What comes with the innovative services provided by intelligent transport
systems (ITS) are potential privacy attacks.
● For example, in traffic monitoring systems, individual users send anonymized
personal location traces continuously to aid in traffic state estimation.
● However, an adversary may link an anonymous GPS trace to a particular person
provided additional knowledge of the person’s residence or working location.
● This can not be achieved by data encryption or hiding the driver identity. We
resort to the notion of inference privacy that sanitizes raw data to limit the
amount of contained private information.
Audio Deepfake Detection 2022: the First Audio Deep
Synthesis Detection Challenge
● http://addchallenge.cn/
● Low-quality fake audio detection: focuses on dealing with bona fide and fully
fake utterances with various real-world noises etc
○ Fully generated utterances
● Partially fake audio detection: distinguish the partially fake audio from the real
○ Generated by manipulating the genuine utterances
● Audio fake game: Solve both an audio generation task and an audio fake
detection task
Aasist: Audio anti-spoofing using integrated
spectro-temporal graph attention networks (Naver)
● Spoofing detection solutions can be an important consideration when
automatic speaker verification systems are deployed in real-world applications.
● Two major scenarios:
○ Logical access (LA): spoofing attacks mounted with voice conversion and TTS
○ Physical access (PA): bona fide utterances are captured and then replayed
● Recent studies show that discriminative information (i.e., spoofing artefacts)
can reside in specific temporal and spectral intervals
Characterizing the adversarial vulnerability of speech
self-supervised learning (Helen Meng)
● Speech processing Universal PERformance Benchmark (SUPERB)
○ Upstream model (self-supervised models) + Downstream models (directly uses features, ex.
finetuning)
● Adversarial Attacks
○ Limited-knowledge adversaries: Attackers can access the internals of the target model
(parameters and gradients). But they do not know which downstream task will be conducted.
○ Zero-knowledge adversaries: Target model is unavailable to the attackers. In such a case, the
substitute model is used for approximating gradients for adversarial sample generation.
○ XAB listening test: check if humans can distinguish adversarial samples
● Results: Attacks are effective, humans cannot easily distinguish.
Adversarial Sample Detection for Speaker Verification by
Neural Vocoders (Tencent)
● Automatic speaker verification (ASV), one of the most important technology for
biometric identification, has been widely adopted in security-critical
applications.
● However, ASV is seriously vulnerable to recently emerged adversarial attacks,
yet effective countermeasures against them are limited.
Source Mixing and Separation Robust Audio Steganography
(Sony)
● Audio steganography is the science of
concealing secret messages inside a host
audio called a carrier in such a way that the
concealment is unnoticeable to human ears.
● Recently, deep neural networks (DNNs) have
been used as a steganographic function for
hiding data inside images to achieve high
capacity.
● The network learns to conceal a hidden
message inside the carrier without manually
specifying a particular redundancy to exploit.
PixInWav: Residual Steganography for
Hiding Pixels in Audio
Exploiting language model for efficient linguistic steganalysis
● Linguistic steganography (LS)
○ Natural language is actually quite suitable for steganography.
○ The advantage is that LS can be easily concealed by the huge number of social activities.
○ (1) modification based and (2) generation based
○ Latter allows more data to be embedded
● Steganalysis = to detect whether there is secret data embedded in the media
● Significant difference between automatically generated stego texts and carrier
texts in terms of the conditional probability distribution of individual words.
Post-COVID Teleconferencing
Acoustic Echo Cancellation
● Acoustic echo refers to the phenomenon that occurs when a microphone picks
up the far-end signal that is played by a loudspeaker.
● This phenomenon can cause a slight annoyance or a significant breakdown in a
communication system.
● ICASSP 2022 AEC Challenge by Microsoft
● Various scenarios
○ Long- or varying delays
○ Strong speaker/mic distortions
○ Stationary/non-stationary noise
○ Glitches (due to high CPU usages)
○ etc.
Deep Noise Suppression
● Audio calls in the presence of
background noises get
significantly degraded in terms
of quality/intelligibility of the
perceived speech.
● ICASSP 2022 Deep Noise
Suppression Challenge by
Microsoft
Multi-Channel Multi-Party Meeting Transcription
● Speaker Diarization
○ Partitioning an input audio stream into homogeneous segments according to the speaker
identity, i.e. "who spoke when?”
● Multi-speaker ASR
○ Hard to do overlapped speech recognition due to the interfering speakers or background noise
● ICASSP 2022 M2MeT Challenge by Alibaba
VarArray: Array-geometry-agnostic continuous speech
separation (Microsoft)
● Continuous speech separation using a microphone array was shown to be
promising in dealing with the speech overlap problem.
● Signals highly depend on the position of the microphones.
● In meetings, we can assume only two or fewer speakers to be active for the
majority of the meeting time.
Multimodal Systems
● Audio-Visual Object Classification For Human-Robot Collaboration
● Multimodal Information Based Speech Processing
● Machine Translation for Spoken and Written Language
● Image and Video Understanding
● Multimodal Signal Processing, Analysis, and Synthesis
● Audio Security and Multi-Modal Systems
● Multi-modal Analysis and Synthesis
● Multimodal Data Fusion and Processing
● Multimodal Analysis in Audio Applications
Applications
Emotion Recognition
● Speech emotion recognition using self-supervised features
○ A modular End-to-End SER system based on an Upstream + Downstream architecture paradigm,
which allows easy use/integration of a large variety of self-supervised features.
● Memobert: Pre-training model with prompt-based learning for multimodal
emotion recognition
○ learns multimodal joint representations through self-supervised learning
○ prompt-based method that reformulates emotion classification as a masked text prediction
● Multimodal Emotion Recognition with Surgical and Fabric Masks
○ investigate how muffled speech and occluded facial expressions change the prediction of
emotions
Speech as a Disease Biomarker
● Fraug: A Frame Rate Based Data Augmentation Method for Depression Detection
from Speech Signals
○ Among others, the speech signal is an important biomarker of our mental state and can be collected
remotely, in a non-invasive manner with no expert supervision.
○ Recently, speech-based automatic diagnosis of depression has gained significant momentum.
● Exploring Dementia Detection from Speech: Cross Corpus Analysis
○ Population aging is responsible for an increase of new Alzheimer’s disease (AD) cases, and creates the
need for scalable, cost-effective methods that are able to detect early stage AD.
○ Speech and language biomarkers are strong indicators of dementia, and provide a low-cost and
widespread alternative for the assessment of cognitive states.
● The Second Dicova Challenge: Dataset and Performance Analysis for Diagnosis of
Covid-19 Using Acoustics
○ Dataset of audio recordings consisting of breathing, cough and speech signals
○ Providing a point-of-care, rapid, easy to use, and cost-effective tool to help contain COVID-19 spread.
Voice Conversion
● Robust disentangled variational speech representation learning for zero-shot
voice conversion (Tencent)
○ Feeding an arbitrary speaker embedding and content embeddings to the VAE decoder
● Controllable Speech Representation Learning Via Voice Conversion and AIC
Loss (Adobe)
○ Its disentangled components (content, pitch, speaker identity, and energy) can be controlled
independently to alter the synthesis result.
● An Investigation of Streaming Non-Autoregressive sequence-to-sequence
Voice Conversion
● Voice Filter: Few-shot text-to-speech speaker adaptation using voice
conversion as a post-processing module (Amazon)
○ It uses voice conversion (VC) as a post-processing module appended to a pre-existing
high-quality TTS system, framing the few-shot TTS problem as a VC task.
Music Applications
● HiFi-SVC: Fast High Fidelity Cross-Domain Singing Voice Conversion
● Music Enhancement via Image Translation and Vocoding (Adobe)
● Source Separation By Steering Pretrained Music Models
● MELONS: generating melody with long-term structure using transformers and
structure graph
● Genre-Conditioned Long-Term 3D Dance Generation Driven by Music
● Deep Performer: Score-to-Audio Music Performance Synthesis (Dolby)
● SleepGAN: Towards Personalized Sleep Therapy Music (Nokia)
● Modeling beats and downbeats with a time-frequency Transformer (ByteDance)
Quantum Machine Learning
● Languages: Google Cirq / Microsoft Q# / IBM Qiskit
● Services: Google Quantum AI / Azure Quantum / IBM Quantum
● The dawn of quantum natural language processing
○ We successfully train a quantum-enhanced Long Short-Term Memory network to perform the
parts-of-speech tagging task via numerical simulations.
○ Practical applications are more likely to be a hybrid of classical and quantum operations. This
hybrid approach is not too different from what has been done in the past decade with GPUs.
○ The main idea behind Quantum Machine Learning (QML) is to replace parts of a neural network
(e.g. linear layers) with a quantum counterpart.
● Quantum federated learning with quantum data
○ Hybrid models fall short when dealing with the highly complex purely quantum data.
○ Thus, purely quantum ML models that can address these challenges were developed, such as
quantum neural networks (QNNs).
○ However, due to the fragile nature of the carriers of quantum data, i.e., qubits, there is a natural
need for distributed learning solutions such as federated learning (FL).
Machine Learning is All You Need
● Audio Representations
○ Learnable Wavelet Packet Transform for Data-Adapted Spectrograms
● Encodings
○ A Low-Parametric Model for Bit-Rate Estimation of VVC Residual Coding
○ Low-Complexity Multi-Model CNN in-Loop Filter for AVS3
● Digital Signal Processing
○ Learning Structured Sparsity For Time-Frequency Reconstruction
○ Learning Approach For Fast Approximate Matrix Factorizations
● Communication Systems
○ Adaptive Wireless Power Allocation with Graph Neural Networks
○ Deep Joint Source-Channel Coding for Wireless Image Transmission with Adaptive Rate Control
● Beamforming
○ Deep learning for location based beamforming with NLOS channels
○ Phase-Only Reconfigurable Sparse Array Beamforming Using Deep Learning
II. Topics related to our tasks
Multilingualism / Cross-lingualism
Joint Unsupervised and Supervised Training for Multilingual
ASR (Google)
● Most existing methods adopt a 2-stage scheme where
the self-supervised loss is optimized in the first
pretraining stage, and the standard supervised
fine-tuning resumes in the second stage.
● In this paper, we propose an end-to-end (E2E) Joint
Unsupervised and Supervised Training (JUST) method
to combine the supervised loss and the
self-supervised contrastive and masked language
modeling (MLM) losses.
● Spectrogram + Quantizer (wav2vec 2.0) + RNN-T
● Wins over XLSR-53!
Pseudo-Labeling for Massively Multilingual Speech
Recognition (Facebook)
● Prev works (from Facebook, similar
authors)
○ Iterative Pseudo-Labeling for Speech Recognition
(IPL) → LM + beam search to generate pseudo
labels
○ slimIPL: Language-model-free iterative
pseudo-labeling (slimIPL) → Use self-predictions
● Utilizing unlabeled data is helpful, even
with trivial methods.
Multilingual Text-To-Speech Training Using Cross Language
Voice Conversion And Self-Supervised Learning Of Speech
Representations (Facebook)
● It’s hard to find speakers who have
native proficiency in several
languages.
● Using HifiGAN-like model to
augment data (Synthetic generation
of target speaker speaking different
language)
A Configurable Multilingual Model is All You Need to
Recognize All Languages (Microsoft)
● Configurable multilingual model
(CMM) to recognize speech from
any combination of languages
based on a multi-hot LID vector
selected by users
● Language-specific vocabulary
strategy (making vocab smaller)
● Language-specific transformer
cell (one per language)
Zero-Shot Cross-Lingual Transfer Using Multi-Stream
Encoder and Efficient Speaker Representation (Tencent)
● Extract speaker embedding features that are
independent of both content information and
language identity.
● Multi-stream = Input text sequences are fed
into N-stream text encoders in parallel
● zero-shot cross-lingual transfer strategy =
fine-tune also with target-lingual data +
language-balanced sampling strategy
Tackling data scarcity in speech translation using zero-shot
multilingual machine translation techniques
● To tackle data scarcity, it is useful to make use of ASR and MT data for
end-to-end ST models. We explore techniques from zero-shot multilingual text
translation and apply them to speech side.
● Use tokens & augmentation
methods to make the model
decide output language based
on language tokens.
Multi-Lingual Multi-Task Speech Emotion Recognition Using
wav2vec 2.0
● Multi-task learning to increase emotion recognition performance
● Additional tasks
○ Gender Prediction (Ge)
○ Language Prediction (La)
○ F0 mean and standard deviation regression task (F0-me, F0-st)
○ Energy mean and standard deviation regression task (En–me, En-st)
○ Voice ratio regression task (Vr)
ADIMA: Abuse Detection In Multilingual Audio
● ADIMA, a novel, linguistically diverse, ethically sourced, expert annotated and
wellbalanced multilingual abuse detection audio dataset comprising of 11,775
audio samples in 10 Indic languages spanning 65 hours and spoken by 6,446
unique users.
SERAB: A multi-lingual benchmark for speech emotion
recognition
● Speech Emotion Recognition Adaptation Benchmark (SERAB), a framework for
evaluating the performance and generalization capacity of different approaches
for utterance-level SER.
Keyword Spotting
Why still keyword spotting?
● For many voice-enabled platforms, queries follow a highly Zipfian distribution.
On the Comcast X1 entertainment system, for example, the top-20 commands
constitute around 30% of the traffic.
● Using an ASR system is excessive for targeting phonetically distinct commands
with a small vocabulary.
● Audio-only based wake word spotting (WWS), a special case of KWS, is
challenging under noisy conditions due to the environmental interference.
Temporal early exiting for streaming speech commands
recognition (Comcast)
● Additionally add prediction heads, stop inference mid-way based on entropy.
A Study of Designing Compact Audio-Visual Wake Word
Spotting System Based on Iterative Fine-Tuning in Neural
Network Pruning
● Audio-visual keyword
spotting
● Using both is helpful
Text Adaptive Detection for Customizable Keyword Spotting
● Novel text adaptive detection
framework to directly formulate
KWS as a detection rather than a
classification problem
● Text prompt is used as input, i.e.,
customizable wake words
Joint Ego-Noise Suppression and Keyword Spotting on
Sweeping Robots (Alibaba)
● a novel approach for joint ego-noise (self-created noise) suppression and
keyword detection
● Small footprint keyword spotting (KWS) on sweeping robot, i.e., the
conversation triggering module of the audio interface
● A circular microphone array of M = 6 → Multiple minimum variance
distortionless response (MVDR) beamformers
● If the keyword is present, noise adaptation will be slowed down to prevent
keyword speech being cancelled.
Unified Speculation, Detection, and Verification Keyword
Spotting (Alexa)
● Speculation → early decision (giving a head start, reduce system latency)
● Detection → keyword trigger task, more accurate decision
● Verification → verifies previous decision (correct mistakes)
● The proposed latency-aware max-pooling loss can control latency accuracy
trade-off effectively.
Few-shot / Low-shot
An Adapter Based Pre-Training for Efficient and Scalable
Self-Supervised Speech Representation Learning (Huawei)
Apply adapters (B) to original w2v2 (A) to combat language forgetting.
https://www.notion.so/hpcnt/An-Adapter-Based-Pre-Training-for-Efficient-and-Scalable-Self-Supervised-Speech-Representation-Learn-004
6747a578d4899b914e520959e01e8
Efficient Adapter Transfer of Self-Supervised Speech
Models for Automatic Speech Recognition (Huawei)
● Fine-tune on ASR task
● Apply adapters
Large-scale ASR Domain Adaptation by Self-and
Semi-supervised Learning (Google)
● Joint training with both RNN-T &
Self-supervised loss (wav2vec 2.0)
● Confidence Estimation Module (CEM)
→ To filter out low confidence samples in
pseudo-labels for Noisy student training
○ binary cross entropy between the estimated
confidence p and the binary target sequence c
● It utilizes Wav2vec2.0 loss on the causal
encoder, so there is no transition gap from
non-causal to causal.
Learning Domain-Invariant Transformation for Speaker
Verification
● Meta-learning to generate domain-invariant embeddings without pre-training
and fine-tuning
● Use both metric loss & classification loss together
Magic dust for cross-lingual adaptation of monolingual
wav2vec-2.0
● Monolingual wav2vec-2.0 is a good few-shot ASR learner in several languages
○ English → 8 Target languages
○ Performance up to 86% compared to XLSR
○ ASR Fine-Tuning on English hurts other languages
● Monolingual wav2vec2 model pre-trained on a high-resource language using
moderately-sized unlabeled data and small-sized labeled data in the target
language yields similar performance at XLSR
● Dropout Uncertainty-Driven Self-Training (DUST)
○ Leverages unlabeled data by pseudo-labeling (semi-supervised)
○ Student from a previous round becomes the teacher for the next round
Audio Augmentation
Filteraugment: An acoustic environmental data
augmentation method
● FilterAugment mimics acoustic filters by applying different weights on
frequency bands, therefore enables model to extract relevant information from
wider frequency region.
● Improved version of frequency
masking which masks information
on random frequency bands.
Auditory-Based Data Augmentation for end-to-end
Automatic Speech Recognition
● Spectral smearing smooths the
speech spectrum and suppresses
details by broadening the
bandwidths of the auditory filters.
● Loudness recruitment compresses
amplitudes of different frequency
bands, simulates damaged ear.
Intermix: An Interference-Based Data Augmentation and
Regularization Technique for Automatic Deep Sound
Classification
● Prev work: BC learning
○ Taking sound energy into account
● Prev work: SpeechMix
○ Similar to manifold mixup,
mix intermediate representations
● This work: InterMix
○ Also apply phase shifts to inputs
& use it when mixing
Robust Speaker Verification Using Population-Based Data
Augmentation
● A population-based searching strategy for optimizing the augmentation
parameters.
● Instead of finding a fixed set of hyper-parameters, PBA learns a scheduler for
setting the hyper-parameters.
● List of augmentation used
○ Reverberation: Convolve with room impulse response (RIR)
○ Music: Music from a randomly selected MUSAN
○ Noise: Noise from MUSAN is added
○ Babble: Babble noise is added
○ Frequency masking
○ Time masking
Various augmentations
● LPC Augment: an LPC-based ASR Data Augmentation Algorithm for Low and
Zero-Resource Children's Dialects
○ The data augmentation procedure consists of perturbing the formant peaks of the Linear
predictive coding (LPC) spectrum during LPC analysis and reconstruction.
○ Compared with SpegAug & Speed perturbation. Did not show absolute advantage.
● ImportantAug: A Data Augmentation Agent for Speech
○ Adding noise to unimportant regions of the speech and not to important regions.
○ Importance is predicted for each utterance by a data augmentation agent that is trained to
maximize the amount of noise it adds while minimizing its impact on recognition performance.
● Fraug: A Frame Rate Based Data Augmentation Method for Depression
Detection from Speech Signals
○ Changing the frame-width and the frame-shift parameters during the feature extraction process
Task-specific Augmentations
● Cross-speaker style transfer for text-to-speech using data augmentation
○ Cross-speaker style transfer for TTS using data augmentation via voice conversion
● Spatial mixup: Directional loudness modification as data augmentation for
sound event localization and detection
○ application of parametric spatial audio effects for data augmentation, which modifies the
directional properties of a multi-channel spatial audio signal encoded in the ambisonics domain.
● Spatial Data Augmentation with Simulated Room Impulse Responses for Sound
Event Localization and Detection
○ Augments spatial characteristics using simulated room impulse responses (RIR). simulated RIRs
are convolved with the source signals to obtain an augmented multi-channel training dataset.
● Distribution augmentation for low-resource expressive text-to-speech
○ Data augmentation through word permutations & Constituency parse based tree substitutions
Federated Learning
Federated learning challenges and opportunities: An
outlook (Amazon Alexa)
● Finding the lower limit of the number of communication rounds
○ Many local updates (for communication efficiency) can still converge to a desirable model.
○ Overly aggressive local updates will harm the performance due to the data heterogeneity
● Constraints
○ Memory constraint (each on-device model needs to be small in size)
○ Computation constraint (devices may perform only a limited number of gradient updates)
● Personalized FL
○ Conventional FL trains one model, personalized FL maintains a collection of client-specific
models
○ Will reduce test errors beyond what is possible with a single global model.
● Challenges of Lifelong FL
○ Online updates with single-pass data
○ Coupling of model training and data generation.
● Challenges on data
○ Data polarity (collected data does not represent the whole data distribution)
○ Data dependency (data are collected from time series with inevitable dependency)
Learnings from Federated Learning in the Real world (Alexa)
● Skewness: “heavy devices” with large amounts of data while there are many
“light users” with only a handful of data points.
● Non-uniform device selection outperforms uniform sampling of FL where it
utilizes the number of input points per device.
● We compare one-shot FL (Uses full range of data, single training) with continual
FL (Avoid storing data, multiple training rounds). We show that continual FL
outperforms the one-shot strategy in some setting, and is overall most
beneficial for heavy devices.
Enabling on-device training of speech recognition models
with federated dropout (Google)
● Communication/computation costs are strongly correlated with the size of the
model being trained. We propose using federated dropout to reduce the size of
client models while training a full-size model server-side.
● Furthermore, we find that federated dropout makes
smaller sub-models to have lower WER, making it
easier to dynamically adjust the model size.
● We use a realistic setting for federated training
of ASR models, wherein a well trained server-side
model is adapted to a new domain with FL on
edge devices.
Federated Self-supervised Learning
● Federated Self-Training for Data-Efficient Audio Recognition (Philips Research)
○ Self-training approach to exploit large-scale on-device unlabeled data to improve the
generalization of audio recognition models
○ Generate pseudo labels & train with softened labels
● Federated Self-Supervised Learning for Acoustic Event Classification (Amazon)
○ Applying FL to improve acoustic event classification (AEC) performance while no customer data
can be directly uploaded to the server
○ No pseudo labels (Common in AEC)
○ Solve the task of predicting the future
audio frame via feature representation
EOD

More Related Content

What's hot

What's hot (20)

[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
 
【DL輪読会】Patches Are All You Need? (ConvMixer)
【DL輪読会】Patches Are All You Need? (ConvMixer)【DL輪読会】Patches Are All You Need? (ConvMixer)
【DL輪読会】Patches Are All You Need? (ConvMixer)
 
【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fields【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fields
 
【DL輪読会】Segment Anything
【DL輪読会】Segment Anything【DL輪読会】Segment Anything
【DL輪読会】Segment Anything
 
Coreset+SVM (論文紹介)
Coreset+SVM (論文紹介)Coreset+SVM (論文紹介)
Coreset+SVM (論文紹介)
 
自己教師学習(Self-Supervised Learning)
自己教師学習(Self-Supervised Learning)自己教師学習(Self-Supervised Learning)
自己教師学習(Self-Supervised Learning)
 
微分可能な信号処理に基づく音声合成器を用いた DNN 音声パラメータ推定の検討
微分可能な信号処理に基づく音声合成器を用いた DNN 音声パラメータ推定の検討微分可能な信号処理に基づく音声合成器を用いた DNN 音声パラメータ推定の検討
微分可能な信号処理に基づく音声合成器を用いた DNN 音声パラメータ推定の検討
 
Deeplearning輪読会
Deeplearning輪読会Deeplearning輪読会
Deeplearning輪読会
 
実装レベルで学ぶVQVAE
実装レベルで学ぶVQVAE実装レベルで学ぶVQVAE
実装レベルで学ぶVQVAE
 
【DL輪読会】A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
【DL輪読会】A Time Series is Worth 64 Words: Long-term Forecasting with Transformers【DL輪読会】A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
【DL輪読会】A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
 
WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響
 
機械学習モデルのハイパパラメータ最適化
機械学習モデルのハイパパラメータ最適化機械学習モデルのハイパパラメータ最適化
機械学習モデルのハイパパラメータ最適化
 
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
 
東京大学2020年度深層学習(Deep learning基礎講座) 第9回「深層学習と自然言語処理」
東京大学2020年度深層学習(Deep learning基礎講座) 第9回「深層学習と自然言語処理」東京大学2020年度深層学習(Deep learning基礎講座) 第9回「深層学習と自然言語処理」
東京大学2020年度深層学習(Deep learning基礎講座) 第9回「深層学習と自然言語処理」
 
Active Learning と Bayesian Neural Network
Active Learning と Bayesian Neural NetworkActive Learning と Bayesian Neural Network
Active Learning と Bayesian Neural Network
 
[DL輪読会]Wavenet a generative model for raw audio
[DL輪読会]Wavenet a generative model for raw audio[DL輪読会]Wavenet a generative model for raw audio
[DL輪読会]Wavenet a generative model for raw audio
 
Bayesian Neural Networks : Survey
Bayesian Neural Networks : SurveyBayesian Neural Networks : Survey
Bayesian Neural Networks : Survey
 
深層学習の不確実性 - Uncertainty in Deep Neural Networks -
深層学習の不確実性 - Uncertainty in Deep Neural Networks -深層学習の不確実性 - Uncertainty in Deep Neural Networks -
深層学習の不確実性 - Uncertainty in Deep Neural Networks -
 
High-impact Papers in Computer Vision: 歴史を変えた/トレンドを創る論文
High-impact Papers in Computer Vision: 歴史を変えた/トレンドを創る論文High-impact Papers in Computer Vision: 歴史を変えた/トレンドを創る論文
High-impact Papers in Computer Vision: 歴史を変えた/トレンドを創る論文
 
[DL輪読会]BANMo: Building Animatable 3D Neural Models from Many Casual Videos
[DL輪読会]BANMo: Building Animatable 3D Neural Models from Many Casual Videos[DL輪読会]BANMo: Building Animatable 3D Neural Models from Many Casual Videos
[DL輪読会]BANMo: Building Animatable 3D Neural Models from Many Casual Videos
 

Similar to Trends of ICASSP 2022

Learning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyondLearning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyond
Isabelle Augenstein
 
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptxLiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
VishnuRajuV
 
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn..."Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
Edge AI and Vision Alliance
 
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptxEXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
AtulKumarUpadhyay4
 

Similar to Trends of ICASSP 2022 (20)

Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue Systems2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue Systems
 
#1 Berlin Students in AI, Machine Learning & NLP presentation
#1 Berlin Students in AI, Machine Learning & NLP presentation#1 Berlin Students in AI, Machine Learning & NLP presentation
#1 Berlin Students in AI, Machine Learning & NLP presentation
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
 
Learning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyondLearning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyond
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
 
Deep Learning | Speaker Indentification
Deep Learning | Speaker IndentificationDeep Learning | Speaker Indentification
Deep Learning | Speaker Indentification
 
Conversational Agents in Portuguese: A Study Using Deep Learning
Conversational Agents in Portuguese: A Study Using Deep LearningConversational Agents in Portuguese: A Study Using Deep Learning
Conversational Agents in Portuguese: A Study Using Deep Learning
 
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptxLiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
 
neural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classificationneural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classification
 
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupDeep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - Meetup
 
Deep learning: the future of recommendations
Deep learning: the future of recommendationsDeep learning: the future of recommendations
Deep learning: the future of recommendations
 
Esa act
Esa actEsa act
Esa act
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupDealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
 
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn..."Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
 
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptxEXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
 
Rise of AI through DL
Rise of AI through DLRise of AI through DL
Rise of AI through DL
 

More from Kwanghee Choi

More from Kwanghee Choi (19)

Visual Transformers
Visual TransformersVisual Transformers
Visual Transformers
 
추천 시스템 한 발짝 떨어져 살펴보기 (3)
추천 시스템 한 발짝 떨어져 살펴보기 (3)추천 시스템 한 발짝 떨어져 살펴보기 (3)
추천 시스템 한 발짝 떨어져 살펴보기 (3)
 
Recommendation systems: Vertical and Horizontal Scrolls
Recommendation systems: Vertical and Horizontal ScrollsRecommendation systems: Vertical and Horizontal Scrolls
Recommendation systems: Vertical and Horizontal Scrolls
 
추천 시스템 한 발짝 떨어져 살펴보기 (1)
추천 시스템 한 발짝 떨어져 살펴보기 (1)추천 시스템 한 발짝 떨어져 살펴보기 (1)
추천 시스템 한 발짝 떨어져 살펴보기 (1)
 
추천 시스템 한 발짝 떨어져 살펴보기 (2)
추천 시스템 한 발짝 떨어져 살펴보기 (2)추천 시스템 한 발짝 떨어져 살펴보기 (2)
추천 시스템 한 발짝 떨어져 살펴보기 (2)
 
Before and After the AI Winter - Recap
Before and After the AI Winter - RecapBefore and After the AI Winter - Recap
Before and After the AI Winter - Recap
 
Mastering Gomoku - Recap
Mastering Gomoku - RecapMastering Gomoku - Recap
Mastering Gomoku - Recap
 
Teachings of Ada Lovelace
Teachings of Ada LovelaceTeachings of Ada Lovelace
Teachings of Ada Lovelace
 
div, grad, curl, and all that - a review
div, grad, curl, and all that - a reviewdiv, grad, curl, and all that - a review
div, grad, curl, and all that - a review
 
Gaussian processes
Gaussian processesGaussian processes
Gaussian processes
 
Neural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to LearnNeural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to Learn
 
Duality between OOP and RL
Duality between OOP and RLDuality between OOP and RL
Duality between OOP and RL
 
JFEF encoding
JFEF encodingJFEF encoding
JFEF encoding
 
Bandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summaryBandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summary
 
Dummy log generation using poisson sampling
 Dummy log generation using poisson sampling Dummy log generation using poisson sampling
Dummy log generation using poisson sampling
 
Azure functions: Quickstart
Azure functions: QuickstartAzure functions: Quickstart
Azure functions: Quickstart
 
Modern convolutional object detectors
Modern convolutional object detectorsModern convolutional object detectors
Modern convolutional object detectors
 
Usage of Moving Average
Usage of Moving AverageUsage of Moving Average
Usage of Moving Average
 
Jpl coding standard for the c programming language
Jpl coding standard for the c programming languageJpl coding standard for the c programming language
Jpl coding standard for the c programming language
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Trends of ICASSP 2022

  • 2. Before we start ● ICASSP? ○ De-facto conference for audio/speech (with InterSpeech) ● 2022 ICASSP Stats ○ 3967 papers submitted, 1785 (45.0%) accepted ○ https://github.com/lixin4ever/Conference-Acceptance-Rate ○ About ~100 papers skimmed ● Slides ○ General topics & Service-related topics ○ Can’t go through everything, will try to finish quickly
  • 3. Contents 1. General Trend 1. 2022 SotA 2. Contrastive / Self-supervised 3. Security 4. Post-COVID Teleconferencing 5. Applications 2. Topics related to our tasks 1. Multilingualism / Cross-lingualism 2. Keyword Spotting 3. Few-shot / Low-shot 4. Audio Augmentation 5. Federated Learning
  • 6. General Models ● Wav2vec (Facebook) ○ Raw audio + 1D Convs + BERT + CTC Loss (Similar to MLM, feature distractors) ● Wav2vec 2.0 (Facebook) ○ Raw audio + 1D Convs + BERT + Codebooks + Contrastive loss (quantized distractors) ● HuBERT (Facebook) ○ Raw audio + 1D Convs + BERT + MFCC-based Clustering + Cluster Ensembles ● BigSSL/CAP12 (Google) ○ Spectrogram + 1D Convs + Conformers + Contrastive loss (quantized distractors) ● Data2vec (Facebook) ○ Raw audio + 1D Convs + BERT + Student-teacher (EMA) prediction ● WavLM (Microsoft) ○ HuBERT + Speech denoising + Gated relative position bias
  • 7. GLUE-like Benchmarks ● SUPERB (NTU) ○ Speech processing Universal PERformance Benchmark https://superbbenchmark.org/ ○ Recognition: Phoneme Recognition, Automatic Speech Recognition ○ Detection: Keyword Spotting, Query by Example Spoken Term Detection ○ Semantics: Intent Classification, Slot Filling, Speech Translation ○ Speaker: Speaker Identification, Automatic Speaker Verification, Speaker Diarization ○ Paralinguistics: Emotion Recognition ○ Generation: Speech enhancement, Speech Separation ● NOSS (Google) ○ NOn-Semantic Speech Benchmark https://ai.googleblog.com/2020/06/improving-speech-representations-and.html ○ Speaker identification, Language identification, Command, Emotion, Dementia/healthy ● HARES (Deepmind) ○ Holistic audio representation evaluation suite ○ Environment: Audio tagging, Animal/Scene classification ○ Speech: Keyword Spotting, Intention Classification, Language identification, Speaker identification ○ Music: Instrument Identification, Pitch estimation, Music tagging
  • 8. 1.2 Contrastive / Self-supervised
  • 9. Towards Learning Universal Audio Representations (DeepMind) ● HARES: New BLEU-like benchmark ● SlowFast (from Video) + NFNet (from Vision) seems to be great. ○ SlowFast: Two branches with bigger/smaller kernel width ○ NFNet: Normalizer-Free ResNets ● CPC (contrastive learning) works quite well. https://www.notion.so/hpcnt/Towards-Learning-Universal-Audio-Repres entations-ed8774b85de143c097175b3646cd84e1
  • 10. Universal paralinguistic speech representations using self-supervised conformers (Google) ● Contrastive learning (of w2v2) on Conformers ○ Future works already conducted on distilling this model (TRILLsson) ○ https://ai.googleblog.com/2022/03/trillsson-small-universal-speech.html ● Closely following previous work BigSSL (Google) ○ Trained with speech-heavy youtube videos ○ Their conclusion: SSL + Large Models are especially helpful for small datasets ● Best performance wasn’t from the final layer’s feature vector. (Same conclusion from BigSSL) → CAP12 (12th layer feature outputs) https://www.notion.so/hpcnt/Universal-paralinguistic-speech-representations-using-self-supervised-conformers-d621d75b95eb 4369ab34cc5237603393
  • 11. A Noise-Robust Self-supervised Pre-training Model Based Speech Representation Learning for Automatic Speech Recognition Making the w2v feature encoder robust to additional noise via contrastive loss https://www.notion.so/hpcnt/A-Noise-Robust-Self-supervised-Pr e-training-Model-Based-Speech-Representation-Learning-for-Aut omatic-537ec0ccbd874303840b582db90a3a9d
  • 12. Wav2vec-switch: Contrastive learning from original-noisy speech pairs for robust speech recognition (Microsoft) ● Contextualized representation being robust to noise Original w2v2 loss
  • 13. DistilHuBERT: Speech representation learning by layer-wise distillation of hidden-unit BERT (NTU)
  • 14. Improving Self-Supervised Learning for Speech Recognition with Intermediate Layer Supervision (Microsoft) ● The common practice of SSL is to compute the self-supervised loss on the top layer, such as wav2vec 2.0 and HuBERT. ● However, the lower layers of such a pre-trained model is shown to have a low correlation with phonetic information. ● In this work, we propose to apply intermediate layer supervision to encourage lower layers to learn content knowledge → Apply exact same thing of HUBERT to lower layers
  • 15. Exploring Heterogeneous Characteristics of Layers in ASR Models for More Efficient Training (Google) ● Based on “Are All Layers Created Equal?” (Bengio) ○ Fix intermediate layers’ weights to some other weights ○ Re-initialization: Come back to initial values ○ Re-randomization: Get random weights ● While ambient layers were present in all model sizes, we observed that larger models had more ambient layers, i.e., overparameterized models. ● During early rounds, the ambient layers were more spread throughout the model; only later the separation become more distinct. ● GN was more robust (against re-random) than BN.
  • 16. Investigation of Robustness of Hubert Features from Different Layers to Domain, Accent and Language Variations ● Our experiments indicate that as domain, accent, bandwidth and language deviates from the source domain, the relative improvement decreases. ● The last layer of HuBERT is very specific to the dataset on which it is trained. The second last layer seems to be better when there is domain and accent differences. ● Middle layers are more suited when data is from a different language.
  • 17. Don't speak too fast: The impact of data bias on self-supervised speech models (NTU) ● Use SUPERB benchmark to differ Gender, Content, Speech speed of pre-trained datasets ● Gender → Adding few minor class samples will mitigate performance drop ● Content → Model didn’t care perplexity ● Speech speed → Faster speech is worse
  • 19. Speech anonymization (Emmanuel Vincent) ● Speech information ○ Verbal content (identifiers, private info, etc) ○ Speaker (identity, gender, age, ethnic origin, etc) ○ Nonverbal content (emotion, health, etc) ○ Acoustic environment (acoustics, other speakers, etc) ● Risks ○ User profiling, user identification, voice cloning, information leakage ● Methods ○ Embedded systems, Cryptography, Obfuscation, Anonymization, Federated Learning, etc ○ Simple modifications (ex. Pitch shifts) utterly fail for knowledgeable attackers ● Current speech anonymization challenge != Legal defn. ○ It seems that many big companies doesn’t anonymize speech (collected from various sources) ○ Task: (1) ASR (2) Emotion recognition
  • 20. Preserving Trajectory Privacy in Driving Data Release ● What comes with the innovative services provided by intelligent transport systems (ITS) are potential privacy attacks. ● For example, in traffic monitoring systems, individual users send anonymized personal location traces continuously to aid in traffic state estimation. ● However, an adversary may link an anonymous GPS trace to a particular person provided additional knowledge of the person’s residence or working location. ● This can not be achieved by data encryption or hiding the driver identity. We resort to the notion of inference privacy that sanitizes raw data to limit the amount of contained private information.
  • 21. Audio Deepfake Detection 2022: the First Audio Deep Synthesis Detection Challenge ● http://addchallenge.cn/ ● Low-quality fake audio detection: focuses on dealing with bona fide and fully fake utterances with various real-world noises etc ○ Fully generated utterances ● Partially fake audio detection: distinguish the partially fake audio from the real ○ Generated by manipulating the genuine utterances ● Audio fake game: Solve both an audio generation task and an audio fake detection task
  • 22. Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks (Naver) ● Spoofing detection solutions can be an important consideration when automatic speaker verification systems are deployed in real-world applications. ● Two major scenarios: ○ Logical access (LA): spoofing attacks mounted with voice conversion and TTS ○ Physical access (PA): bona fide utterances are captured and then replayed ● Recent studies show that discriminative information (i.e., spoofing artefacts) can reside in specific temporal and spectral intervals
  • 23. Characterizing the adversarial vulnerability of speech self-supervised learning (Helen Meng) ● Speech processing Universal PERformance Benchmark (SUPERB) ○ Upstream model (self-supervised models) + Downstream models (directly uses features, ex. finetuning) ● Adversarial Attacks ○ Limited-knowledge adversaries: Attackers can access the internals of the target model (parameters and gradients). But they do not know which downstream task will be conducted. ○ Zero-knowledge adversaries: Target model is unavailable to the attackers. In such a case, the substitute model is used for approximating gradients for adversarial sample generation. ○ XAB listening test: check if humans can distinguish adversarial samples ● Results: Attacks are effective, humans cannot easily distinguish.
  • 24. Adversarial Sample Detection for Speaker Verification by Neural Vocoders (Tencent) ● Automatic speaker verification (ASV), one of the most important technology for biometric identification, has been widely adopted in security-critical applications. ● However, ASV is seriously vulnerable to recently emerged adversarial attacks, yet effective countermeasures against them are limited.
  • 25. Source Mixing and Separation Robust Audio Steganography (Sony) ● Audio steganography is the science of concealing secret messages inside a host audio called a carrier in such a way that the concealment is unnoticeable to human ears. ● Recently, deep neural networks (DNNs) have been used as a steganographic function for hiding data inside images to achieve high capacity. ● The network learns to conceal a hidden message inside the carrier without manually specifying a particular redundancy to exploit. PixInWav: Residual Steganography for Hiding Pixels in Audio
  • 26. Exploiting language model for efficient linguistic steganalysis ● Linguistic steganography (LS) ○ Natural language is actually quite suitable for steganography. ○ The advantage is that LS can be easily concealed by the huge number of social activities. ○ (1) modification based and (2) generation based ○ Latter allows more data to be embedded ● Steganalysis = to detect whether there is secret data embedded in the media ● Significant difference between automatically generated stego texts and carrier texts in terms of the conditional probability distribution of individual words.
  • 28. Acoustic Echo Cancellation ● Acoustic echo refers to the phenomenon that occurs when a microphone picks up the far-end signal that is played by a loudspeaker. ● This phenomenon can cause a slight annoyance or a significant breakdown in a communication system. ● ICASSP 2022 AEC Challenge by Microsoft ● Various scenarios ○ Long- or varying delays ○ Strong speaker/mic distortions ○ Stationary/non-stationary noise ○ Glitches (due to high CPU usages) ○ etc.
  • 29. Deep Noise Suppression ● Audio calls in the presence of background noises get significantly degraded in terms of quality/intelligibility of the perceived speech. ● ICASSP 2022 Deep Noise Suppression Challenge by Microsoft
  • 30. Multi-Channel Multi-Party Meeting Transcription ● Speaker Diarization ○ Partitioning an input audio stream into homogeneous segments according to the speaker identity, i.e. "who spoke when?” ● Multi-speaker ASR ○ Hard to do overlapped speech recognition due to the interfering speakers or background noise ● ICASSP 2022 M2MeT Challenge by Alibaba
  • 31. VarArray: Array-geometry-agnostic continuous speech separation (Microsoft) ● Continuous speech separation using a microphone array was shown to be promising in dealing with the speech overlap problem. ● Signals highly depend on the position of the microphones. ● In meetings, we can assume only two or fewer speakers to be active for the majority of the meeting time.
  • 32. Multimodal Systems ● Audio-Visual Object Classification For Human-Robot Collaboration ● Multimodal Information Based Speech Processing ● Machine Translation for Spoken and Written Language ● Image and Video Understanding ● Multimodal Signal Processing, Analysis, and Synthesis ● Audio Security and Multi-Modal Systems ● Multi-modal Analysis and Synthesis ● Multimodal Data Fusion and Processing ● Multimodal Analysis in Audio Applications
  • 34. Emotion Recognition ● Speech emotion recognition using self-supervised features ○ A modular End-to-End SER system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration of a large variety of self-supervised features. ● Memobert: Pre-training model with prompt-based learning for multimodal emotion recognition ○ learns multimodal joint representations through self-supervised learning ○ prompt-based method that reformulates emotion classification as a masked text prediction ● Multimodal Emotion Recognition with Surgical and Fabric Masks ○ investigate how muffled speech and occluded facial expressions change the prediction of emotions
  • 35. Speech as a Disease Biomarker ● Fraug: A Frame Rate Based Data Augmentation Method for Depression Detection from Speech Signals ○ Among others, the speech signal is an important biomarker of our mental state and can be collected remotely, in a non-invasive manner with no expert supervision. ○ Recently, speech-based automatic diagnosis of depression has gained significant momentum. ● Exploring Dementia Detection from Speech: Cross Corpus Analysis ○ Population aging is responsible for an increase of new Alzheimer’s disease (AD) cases, and creates the need for scalable, cost-effective methods that are able to detect early stage AD. ○ Speech and language biomarkers are strong indicators of dementia, and provide a low-cost and widespread alternative for the assessment of cognitive states. ● The Second Dicova Challenge: Dataset and Performance Analysis for Diagnosis of Covid-19 Using Acoustics ○ Dataset of audio recordings consisting of breathing, cough and speech signals ○ Providing a point-of-care, rapid, easy to use, and cost-effective tool to help contain COVID-19 spread.
  • 36. Voice Conversion ● Robust disentangled variational speech representation learning for zero-shot voice conversion (Tencent) ○ Feeding an arbitrary speaker embedding and content embeddings to the VAE decoder ● Controllable Speech Representation Learning Via Voice Conversion and AIC Loss (Adobe) ○ Its disentangled components (content, pitch, speaker identity, and energy) can be controlled independently to alter the synthesis result. ● An Investigation of Streaming Non-Autoregressive sequence-to-sequence Voice Conversion ● Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module (Amazon) ○ It uses voice conversion (VC) as a post-processing module appended to a pre-existing high-quality TTS system, framing the few-shot TTS problem as a VC task.
  • 37. Music Applications ● HiFi-SVC: Fast High Fidelity Cross-Domain Singing Voice Conversion ● Music Enhancement via Image Translation and Vocoding (Adobe) ● Source Separation By Steering Pretrained Music Models ● MELONS: generating melody with long-term structure using transformers and structure graph ● Genre-Conditioned Long-Term 3D Dance Generation Driven by Music ● Deep Performer: Score-to-Audio Music Performance Synthesis (Dolby) ● SleepGAN: Towards Personalized Sleep Therapy Music (Nokia) ● Modeling beats and downbeats with a time-frequency Transformer (ByteDance)
  • 38. Quantum Machine Learning ● Languages: Google Cirq / Microsoft Q# / IBM Qiskit ● Services: Google Quantum AI / Azure Quantum / IBM Quantum ● The dawn of quantum natural language processing ○ We successfully train a quantum-enhanced Long Short-Term Memory network to perform the parts-of-speech tagging task via numerical simulations. ○ Practical applications are more likely to be a hybrid of classical and quantum operations. This hybrid approach is not too different from what has been done in the past decade with GPUs. ○ The main idea behind Quantum Machine Learning (QML) is to replace parts of a neural network (e.g. linear layers) with a quantum counterpart. ● Quantum federated learning with quantum data ○ Hybrid models fall short when dealing with the highly complex purely quantum data. ○ Thus, purely quantum ML models that can address these challenges were developed, such as quantum neural networks (QNNs). ○ However, due to the fragile nature of the carriers of quantum data, i.e., qubits, there is a natural need for distributed learning solutions such as federated learning (FL).
  • 39. Machine Learning is All You Need ● Audio Representations ○ Learnable Wavelet Packet Transform for Data-Adapted Spectrograms ● Encodings ○ A Low-Parametric Model for Bit-Rate Estimation of VVC Residual Coding ○ Low-Complexity Multi-Model CNN in-Loop Filter for AVS3 ● Digital Signal Processing ○ Learning Structured Sparsity For Time-Frequency Reconstruction ○ Learning Approach For Fast Approximate Matrix Factorizations ● Communication Systems ○ Adaptive Wireless Power Allocation with Graph Neural Networks ○ Deep Joint Source-Channel Coding for Wireless Image Transmission with Adaptive Rate Control ● Beamforming ○ Deep learning for location based beamforming with NLOS channels ○ Phase-Only Reconfigurable Sparse Array Beamforming Using Deep Learning
  • 40. II. Topics related to our tasks
  • 42. Joint Unsupervised and Supervised Training for Multilingual ASR (Google) ● Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard supervised fine-tuning resumes in the second stage. ● In this paper, we propose an end-to-end (E2E) Joint Unsupervised and Supervised Training (JUST) method to combine the supervised loss and the self-supervised contrastive and masked language modeling (MLM) losses. ● Spectrogram + Quantizer (wav2vec 2.0) + RNN-T ● Wins over XLSR-53!
  • 43. Pseudo-Labeling for Massively Multilingual Speech Recognition (Facebook) ● Prev works (from Facebook, similar authors) ○ Iterative Pseudo-Labeling for Speech Recognition (IPL) → LM + beam search to generate pseudo labels ○ slimIPL: Language-model-free iterative pseudo-labeling (slimIPL) → Use self-predictions ● Utilizing unlabeled data is helpful, even with trivial methods.
  • 44. Multilingual Text-To-Speech Training Using Cross Language Voice Conversion And Self-Supervised Learning Of Speech Representations (Facebook) ● It’s hard to find speakers who have native proficiency in several languages. ● Using HifiGAN-like model to augment data (Synthetic generation of target speaker speaking different language)
  • 45. A Configurable Multilingual Model is All You Need to Recognize All Languages (Microsoft) ● Configurable multilingual model (CMM) to recognize speech from any combination of languages based on a multi-hot LID vector selected by users ● Language-specific vocabulary strategy (making vocab smaller) ● Language-specific transformer cell (one per language)
  • 46. Zero-Shot Cross-Lingual Transfer Using Multi-Stream Encoder and Efficient Speaker Representation (Tencent) ● Extract speaker embedding features that are independent of both content information and language identity. ● Multi-stream = Input text sequences are fed into N-stream text encoders in parallel ● zero-shot cross-lingual transfer strategy = fine-tune also with target-lingual data + language-balanced sampling strategy
  • 47. Tackling data scarcity in speech translation using zero-shot multilingual machine translation techniques ● To tackle data scarcity, it is useful to make use of ASR and MT data for end-to-end ST models. We explore techniques from zero-shot multilingual text translation and apply them to speech side. ● Use tokens & augmentation methods to make the model decide output language based on language tokens.
  • 48. Multi-Lingual Multi-Task Speech Emotion Recognition Using wav2vec 2.0 ● Multi-task learning to increase emotion recognition performance ● Additional tasks ○ Gender Prediction (Ge) ○ Language Prediction (La) ○ F0 mean and standard deviation regression task (F0-me, F0-st) ○ Energy mean and standard deviation regression task (En–me, En-st) ○ Voice ratio regression task (Vr)
  • 49. ADIMA: Abuse Detection In Multilingual Audio ● ADIMA, a novel, linguistically diverse, ethically sourced, expert annotated and wellbalanced multilingual abuse detection audio dataset comprising of 11,775 audio samples in 10 Indic languages spanning 65 hours and spoken by 6,446 unique users.
  • 50. SERAB: A multi-lingual benchmark for speech emotion recognition ● Speech Emotion Recognition Adaptation Benchmark (SERAB), a framework for evaluating the performance and generalization capacity of different approaches for utterance-level SER.
  • 52. Why still keyword spotting? ● For many voice-enabled platforms, queries follow a highly Zipfian distribution. On the Comcast X1 entertainment system, for example, the top-20 commands constitute around 30% of the traffic. ● Using an ASR system is excessive for targeting phonetically distinct commands with a small vocabulary. ● Audio-only based wake word spotting (WWS), a special case of KWS, is challenging under noisy conditions due to the environmental interference.
  • 53. Temporal early exiting for streaming speech commands recognition (Comcast) ● Additionally add prediction heads, stop inference mid-way based on entropy.
  • 54. A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning ● Audio-visual keyword spotting ● Using both is helpful
  • 55. Text Adaptive Detection for Customizable Keyword Spotting ● Novel text adaptive detection framework to directly formulate KWS as a detection rather than a classification problem ● Text prompt is used as input, i.e., customizable wake words
  • 56. Joint Ego-Noise Suppression and Keyword Spotting on Sweeping Robots (Alibaba) ● a novel approach for joint ego-noise (self-created noise) suppression and keyword detection ● Small footprint keyword spotting (KWS) on sweeping robot, i.e., the conversation triggering module of the audio interface ● A circular microphone array of M = 6 → Multiple minimum variance distortionless response (MVDR) beamformers ● If the keyword is present, noise adaptation will be slowed down to prevent keyword speech being cancelled.
  • 57. Unified Speculation, Detection, and Verification Keyword Spotting (Alexa) ● Speculation → early decision (giving a head start, reduce system latency) ● Detection → keyword trigger task, more accurate decision ● Verification → verifies previous decision (correct mistakes) ● The proposed latency-aware max-pooling loss can control latency accuracy trade-off effectively.
  • 59. An Adapter Based Pre-Training for Efficient and Scalable Self-Supervised Speech Representation Learning (Huawei) Apply adapters (B) to original w2v2 (A) to combat language forgetting. https://www.notion.so/hpcnt/An-Adapter-Based-Pre-Training-for-Efficient-and-Scalable-Self-Supervised-Speech-Representation-Learn-004 6747a578d4899b914e520959e01e8
  • 60. Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition (Huawei) ● Fine-tune on ASR task ● Apply adapters
  • 61. Large-scale ASR Domain Adaptation by Self-and Semi-supervised Learning (Google) ● Joint training with both RNN-T & Self-supervised loss (wav2vec 2.0) ● Confidence Estimation Module (CEM) → To filter out low confidence samples in pseudo-labels for Noisy student training ○ binary cross entropy between the estimated confidence p and the binary target sequence c ● It utilizes Wav2vec2.0 loss on the causal encoder, so there is no transition gap from non-causal to causal.
  • 62. Learning Domain-Invariant Transformation for Speaker Verification ● Meta-learning to generate domain-invariant embeddings without pre-training and fine-tuning ● Use both metric loss & classification loss together
  • 63. Magic dust for cross-lingual adaptation of monolingual wav2vec-2.0 ● Monolingual wav2vec-2.0 is a good few-shot ASR learner in several languages ○ English → 8 Target languages ○ Performance up to 86% compared to XLSR ○ ASR Fine-Tuning on English hurts other languages ● Monolingual wav2vec2 model pre-trained on a high-resource language using moderately-sized unlabeled data and small-sized labeled data in the target language yields similar performance at XLSR ● Dropout Uncertainty-Driven Self-Training (DUST) ○ Leverages unlabeled data by pseudo-labeling (semi-supervised) ○ Student from a previous round becomes the teacher for the next round
  • 65. Filteraugment: An acoustic environmental data augmentation method ● FilterAugment mimics acoustic filters by applying different weights on frequency bands, therefore enables model to extract relevant information from wider frequency region. ● Improved version of frequency masking which masks information on random frequency bands.
  • 66. Auditory-Based Data Augmentation for end-to-end Automatic Speech Recognition ● Spectral smearing smooths the speech spectrum and suppresses details by broadening the bandwidths of the auditory filters. ● Loudness recruitment compresses amplitudes of different frequency bands, simulates damaged ear.
  • 67. Intermix: An Interference-Based Data Augmentation and Regularization Technique for Automatic Deep Sound Classification ● Prev work: BC learning ○ Taking sound energy into account ● Prev work: SpeechMix ○ Similar to manifold mixup, mix intermediate representations ● This work: InterMix ○ Also apply phase shifts to inputs & use it when mixing
  • 68. Robust Speaker Verification Using Population-Based Data Augmentation ● A population-based searching strategy for optimizing the augmentation parameters. ● Instead of finding a fixed set of hyper-parameters, PBA learns a scheduler for setting the hyper-parameters. ● List of augmentation used ○ Reverberation: Convolve with room impulse response (RIR) ○ Music: Music from a randomly selected MUSAN ○ Noise: Noise from MUSAN is added ○ Babble: Babble noise is added ○ Frequency masking ○ Time masking
  • 69. Various augmentations ● LPC Augment: an LPC-based ASR Data Augmentation Algorithm for Low and Zero-Resource Children's Dialects ○ The data augmentation procedure consists of perturbing the formant peaks of the Linear predictive coding (LPC) spectrum during LPC analysis and reconstruction. ○ Compared with SpegAug & Speed perturbation. Did not show absolute advantage. ● ImportantAug: A Data Augmentation Agent for Speech ○ Adding noise to unimportant regions of the speech and not to important regions. ○ Importance is predicted for each utterance by a data augmentation agent that is trained to maximize the amount of noise it adds while minimizing its impact on recognition performance. ● Fraug: A Frame Rate Based Data Augmentation Method for Depression Detection from Speech Signals ○ Changing the frame-width and the frame-shift parameters during the feature extraction process
  • 70. Task-specific Augmentations ● Cross-speaker style transfer for text-to-speech using data augmentation ○ Cross-speaker style transfer for TTS using data augmentation via voice conversion ● Spatial mixup: Directional loudness modification as data augmentation for sound event localization and detection ○ application of parametric spatial audio effects for data augmentation, which modifies the directional properties of a multi-channel spatial audio signal encoded in the ambisonics domain. ● Spatial Data Augmentation with Simulated Room Impulse Responses for Sound Event Localization and Detection ○ Augments spatial characteristics using simulated room impulse responses (RIR). simulated RIRs are convolved with the source signals to obtain an augmented multi-channel training dataset. ● Distribution augmentation for low-resource expressive text-to-speech ○ Data augmentation through word permutations & Constituency parse based tree substitutions
  • 72. Federated learning challenges and opportunities: An outlook (Amazon Alexa) ● Finding the lower limit of the number of communication rounds ○ Many local updates (for communication efficiency) can still converge to a desirable model. ○ Overly aggressive local updates will harm the performance due to the data heterogeneity ● Constraints ○ Memory constraint (each on-device model needs to be small in size) ○ Computation constraint (devices may perform only a limited number of gradient updates) ● Personalized FL ○ Conventional FL trains one model, personalized FL maintains a collection of client-specific models ○ Will reduce test errors beyond what is possible with a single global model. ● Challenges of Lifelong FL ○ Online updates with single-pass data ○ Coupling of model training and data generation. ● Challenges on data ○ Data polarity (collected data does not represent the whole data distribution) ○ Data dependency (data are collected from time series with inevitable dependency)
  • 73. Learnings from Federated Learning in the Real world (Alexa) ● Skewness: “heavy devices” with large amounts of data while there are many “light users” with only a handful of data points. ● Non-uniform device selection outperforms uniform sampling of FL where it utilizes the number of input points per device. ● We compare one-shot FL (Uses full range of data, single training) with continual FL (Avoid storing data, multiple training rounds). We show that continual FL outperforms the one-shot strategy in some setting, and is overall most beneficial for heavy devices.
  • 74. Enabling on-device training of speech recognition models with federated dropout (Google) ● Communication/computation costs are strongly correlated with the size of the model being trained. We propose using federated dropout to reduce the size of client models while training a full-size model server-side. ● Furthermore, we find that federated dropout makes smaller sub-models to have lower WER, making it easier to dynamically adjust the model size. ● We use a realistic setting for federated training of ASR models, wherein a well trained server-side model is adapted to a new domain with FL on edge devices.
  • 75. Federated Self-supervised Learning ● Federated Self-Training for Data-Efficient Audio Recognition (Philips Research) ○ Self-training approach to exploit large-scale on-device unlabeled data to improve the generalization of audio recognition models ○ Generate pseudo labels & train with softened labels ● Federated Self-Supervised Learning for Acoustic Event Classification (Amazon) ○ Applying FL to improve acoustic event classification (AEC) performance while no customer data can be directly uploaded to the server ○ No pseudo labels (Common in AEC) ○ Solve the task of predicting the future audio frame via feature representation
  • 76. EOD