Successfully reported this slideshow.
Your SlideShare is downloading. ×

Trends of ICASSP 2022

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Visual Transformers
Visual Transformers
Loading in …3
×

Check these out next

1 of 76 Ad

More Related Content

Recently uploaded (20)

Advertisement

Trends of ICASSP 2022

  1. 1. ICASSP 2022 Kwanghee Choi
  2. 2. Before we start ● ICASSP? ○ De-facto conference for audio/speech (with InterSpeech) ● 2022 ICASSP Stats ○ 3967 papers submitted, 1785 (45.0%) accepted ○ https://github.com/lixin4ever/Conference-Acceptance-Rate ○ About ~100 papers skimmed ● Slides ○ General topics & Service-related topics ○ Can’t go through everything, will try to finish quickly
  3. 3. Contents 1. General Trend 1. 2022 SotA 2. Contrastive / Self-supervised 3. Security 4. Post-COVID Teleconferencing 5. Applications 2. Topics related to our tasks 1. Multilingualism / Cross-lingualism 2. Keyword Spotting 3. Few-shot / Low-shot 4. Audio Augmentation 5. Federated Learning
  4. 4. I. General Trend
  5. 5. 1.1 2022 SotA
  6. 6. General Models ● Wav2vec (Facebook) ○ Raw audio + 1D Convs + BERT + CTC Loss (Similar to MLM, feature distractors) ● Wav2vec 2.0 (Facebook) ○ Raw audio + 1D Convs + BERT + Codebooks + Contrastive loss (quantized distractors) ● HuBERT (Facebook) ○ Raw audio + 1D Convs + BERT + MFCC-based Clustering + Cluster Ensembles ● BigSSL/CAP12 (Google) ○ Spectrogram + 1D Convs + Conformers + Contrastive loss (quantized distractors) ● Data2vec (Facebook) ○ Raw audio + 1D Convs + BERT + Student-teacher (EMA) prediction ● WavLM (Microsoft) ○ HuBERT + Speech denoising + Gated relative position bias
  7. 7. GLUE-like Benchmarks ● SUPERB (NTU) ○ Speech processing Universal PERformance Benchmark https://superbbenchmark.org/ ○ Recognition: Phoneme Recognition, Automatic Speech Recognition ○ Detection: Keyword Spotting, Query by Example Spoken Term Detection ○ Semantics: Intent Classification, Slot Filling, Speech Translation ○ Speaker: Speaker Identification, Automatic Speaker Verification, Speaker Diarization ○ Paralinguistics: Emotion Recognition ○ Generation: Speech enhancement, Speech Separation ● NOSS (Google) ○ NOn-Semantic Speech Benchmark https://ai.googleblog.com/2020/06/improving-speech-representations-and.html ○ Speaker identification, Language identification, Command, Emotion, Dementia/healthy ● HARES (Deepmind) ○ Holistic audio representation evaluation suite ○ Environment: Audio tagging, Animal/Scene classification ○ Speech: Keyword Spotting, Intention Classification, Language identification, Speaker identification ○ Music: Instrument Identification, Pitch estimation, Music tagging
  8. 8. 1.2 Contrastive / Self-supervised
  9. 9. Towards Learning Universal Audio Representations (DeepMind) ● HARES: New BLEU-like benchmark ● SlowFast (from Video) + NFNet (from Vision) seems to be great. ○ SlowFast: Two branches with bigger/smaller kernel width ○ NFNet: Normalizer-Free ResNets ● CPC (contrastive learning) works quite well. https://www.notion.so/hpcnt/Towards-Learning-Universal-Audio-Repres entations-ed8774b85de143c097175b3646cd84e1
  10. 10. Universal paralinguistic speech representations using self-supervised conformers (Google) ● Contrastive learning (of w2v2) on Conformers ○ Future works already conducted on distilling this model (TRILLsson) ○ https://ai.googleblog.com/2022/03/trillsson-small-universal-speech.html ● Closely following previous work BigSSL (Google) ○ Trained with speech-heavy youtube videos ○ Their conclusion: SSL + Large Models are especially helpful for small datasets ● Best performance wasn’t from the final layer’s feature vector. (Same conclusion from BigSSL) → CAP12 (12th layer feature outputs) https://www.notion.so/hpcnt/Universal-paralinguistic-speech-representations-using-self-supervised-conformers-d621d75b95eb 4369ab34cc5237603393
  11. 11. A Noise-Robust Self-supervised Pre-training Model Based Speech Representation Learning for Automatic Speech Recognition Making the w2v feature encoder robust to additional noise via contrastive loss https://www.notion.so/hpcnt/A-Noise-Robust-Self-supervised-Pr e-training-Model-Based-Speech-Representation-Learning-for-Aut omatic-537ec0ccbd874303840b582db90a3a9d
  12. 12. Wav2vec-switch: Contrastive learning from original-noisy speech pairs for robust speech recognition (Microsoft) ● Contextualized representation being robust to noise Original w2v2 loss
  13. 13. DistilHuBERT: Speech representation learning by layer-wise distillation of hidden-unit BERT (NTU)
  14. 14. Improving Self-Supervised Learning for Speech Recognition with Intermediate Layer Supervision (Microsoft) ● The common practice of SSL is to compute the self-supervised loss on the top layer, such as wav2vec 2.0 and HuBERT. ● However, the lower layers of such a pre-trained model is shown to have a low correlation with phonetic information. ● In this work, we propose to apply intermediate layer supervision to encourage lower layers to learn content knowledge → Apply exact same thing of HUBERT to lower layers
  15. 15. Exploring Heterogeneous Characteristics of Layers in ASR Models for More Efficient Training (Google) ● Based on “Are All Layers Created Equal?” (Bengio) ○ Fix intermediate layers’ weights to some other weights ○ Re-initialization: Come back to initial values ○ Re-randomization: Get random weights ● While ambient layers were present in all model sizes, we observed that larger models had more ambient layers, i.e., overparameterized models. ● During early rounds, the ambient layers were more spread throughout the model; only later the separation become more distinct. ● GN was more robust (against re-random) than BN.
  16. 16. Investigation of Robustness of Hubert Features from Different Layers to Domain, Accent and Language Variations ● Our experiments indicate that as domain, accent, bandwidth and language deviates from the source domain, the relative improvement decreases. ● The last layer of HuBERT is very specific to the dataset on which it is trained. The second last layer seems to be better when there is domain and accent differences. ● Middle layers are more suited when data is from a different language.
  17. 17. Don't speak too fast: The impact of data bias on self-supervised speech models (NTU) ● Use SUPERB benchmark to differ Gender, Content, Speech speed of pre-trained datasets ● Gender → Adding few minor class samples will mitigate performance drop ● Content → Model didn’t care perplexity ● Speech speed → Faster speech is worse
  18. 18. Security
  19. 19. Speech anonymization (Emmanuel Vincent) ● Speech information ○ Verbal content (identifiers, private info, etc) ○ Speaker (identity, gender, age, ethnic origin, etc) ○ Nonverbal content (emotion, health, etc) ○ Acoustic environment (acoustics, other speakers, etc) ● Risks ○ User profiling, user identification, voice cloning, information leakage ● Methods ○ Embedded systems, Cryptography, Obfuscation, Anonymization, Federated Learning, etc ○ Simple modifications (ex. Pitch shifts) utterly fail for knowledgeable attackers ● Current speech anonymization challenge != Legal defn. ○ It seems that many big companies doesn’t anonymize speech (collected from various sources) ○ Task: (1) ASR (2) Emotion recognition
  20. 20. Preserving Trajectory Privacy in Driving Data Release ● What comes with the innovative services provided by intelligent transport systems (ITS) are potential privacy attacks. ● For example, in traffic monitoring systems, individual users send anonymized personal location traces continuously to aid in traffic state estimation. ● However, an adversary may link an anonymous GPS trace to a particular person provided additional knowledge of the person’s residence or working location. ● This can not be achieved by data encryption or hiding the driver identity. We resort to the notion of inference privacy that sanitizes raw data to limit the amount of contained private information.
  21. 21. Audio Deepfake Detection 2022: the First Audio Deep Synthesis Detection Challenge ● http://addchallenge.cn/ ● Low-quality fake audio detection: focuses on dealing with bona fide and fully fake utterances with various real-world noises etc ○ Fully generated utterances ● Partially fake audio detection: distinguish the partially fake audio from the real ○ Generated by manipulating the genuine utterances ● Audio fake game: Solve both an audio generation task and an audio fake detection task
  22. 22. Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks (Naver) ● Spoofing detection solutions can be an important consideration when automatic speaker verification systems are deployed in real-world applications. ● Two major scenarios: ○ Logical access (LA): spoofing attacks mounted with voice conversion and TTS ○ Physical access (PA): bona fide utterances are captured and then replayed ● Recent studies show that discriminative information (i.e., spoofing artefacts) can reside in specific temporal and spectral intervals
  23. 23. Characterizing the adversarial vulnerability of speech self-supervised learning (Helen Meng) ● Speech processing Universal PERformance Benchmark (SUPERB) ○ Upstream model (self-supervised models) + Downstream models (directly uses features, ex. finetuning) ● Adversarial Attacks ○ Limited-knowledge adversaries: Attackers can access the internals of the target model (parameters and gradients). But they do not know which downstream task will be conducted. ○ Zero-knowledge adversaries: Target model is unavailable to the attackers. In such a case, the substitute model is used for approximating gradients for adversarial sample generation. ○ XAB listening test: check if humans can distinguish adversarial samples ● Results: Attacks are effective, humans cannot easily distinguish.
  24. 24. Adversarial Sample Detection for Speaker Verification by Neural Vocoders (Tencent) ● Automatic speaker verification (ASV), one of the most important technology for biometric identification, has been widely adopted in security-critical applications. ● However, ASV is seriously vulnerable to recently emerged adversarial attacks, yet effective countermeasures against them are limited.
  25. 25. Source Mixing and Separation Robust Audio Steganography (Sony) ● Audio steganography is the science of concealing secret messages inside a host audio called a carrier in such a way that the concealment is unnoticeable to human ears. ● Recently, deep neural networks (DNNs) have been used as a steganographic function for hiding data inside images to achieve high capacity. ● The network learns to conceal a hidden message inside the carrier without manually specifying a particular redundancy to exploit. PixInWav: Residual Steganography for Hiding Pixels in Audio
  26. 26. Exploiting language model for efficient linguistic steganalysis ● Linguistic steganography (LS) ○ Natural language is actually quite suitable for steganography. ○ The advantage is that LS can be easily concealed by the huge number of social activities. ○ (1) modification based and (2) generation based ○ Latter allows more data to be embedded ● Steganalysis = to detect whether there is secret data embedded in the media ● Significant difference between automatically generated stego texts and carrier texts in terms of the conditional probability distribution of individual words.
  27. 27. Post-COVID Teleconferencing
  28. 28. Acoustic Echo Cancellation ● Acoustic echo refers to the phenomenon that occurs when a microphone picks up the far-end signal that is played by a loudspeaker. ● This phenomenon can cause a slight annoyance or a significant breakdown in a communication system. ● ICASSP 2022 AEC Challenge by Microsoft ● Various scenarios ○ Long- or varying delays ○ Strong speaker/mic distortions ○ Stationary/non-stationary noise ○ Glitches (due to high CPU usages) ○ etc.
  29. 29. Deep Noise Suppression ● Audio calls in the presence of background noises get significantly degraded in terms of quality/intelligibility of the perceived speech. ● ICASSP 2022 Deep Noise Suppression Challenge by Microsoft
  30. 30. Multi-Channel Multi-Party Meeting Transcription ● Speaker Diarization ○ Partitioning an input audio stream into homogeneous segments according to the speaker identity, i.e. "who spoke when?” ● Multi-speaker ASR ○ Hard to do overlapped speech recognition due to the interfering speakers or background noise ● ICASSP 2022 M2MeT Challenge by Alibaba
  31. 31. VarArray: Array-geometry-agnostic continuous speech separation (Microsoft) ● Continuous speech separation using a microphone array was shown to be promising in dealing with the speech overlap problem. ● Signals highly depend on the position of the microphones. ● In meetings, we can assume only two or fewer speakers to be active for the majority of the meeting time.
  32. 32. Multimodal Systems ● Audio-Visual Object Classification For Human-Robot Collaboration ● Multimodal Information Based Speech Processing ● Machine Translation for Spoken and Written Language ● Image and Video Understanding ● Multimodal Signal Processing, Analysis, and Synthesis ● Audio Security and Multi-Modal Systems ● Multi-modal Analysis and Synthesis ● Multimodal Data Fusion and Processing ● Multimodal Analysis in Audio Applications
  33. 33. Applications
  34. 34. Emotion Recognition ● Speech emotion recognition using self-supervised features ○ A modular End-to-End SER system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration of a large variety of self-supervised features. ● Memobert: Pre-training model with prompt-based learning for multimodal emotion recognition ○ learns multimodal joint representations through self-supervised learning ○ prompt-based method that reformulates emotion classification as a masked text prediction ● Multimodal Emotion Recognition with Surgical and Fabric Masks ○ investigate how muffled speech and occluded facial expressions change the prediction of emotions
  35. 35. Speech as a Disease Biomarker ● Fraug: A Frame Rate Based Data Augmentation Method for Depression Detection from Speech Signals ○ Among others, the speech signal is an important biomarker of our mental state and can be collected remotely, in a non-invasive manner with no expert supervision. ○ Recently, speech-based automatic diagnosis of depression has gained significant momentum. ● Exploring Dementia Detection from Speech: Cross Corpus Analysis ○ Population aging is responsible for an increase of new Alzheimer’s disease (AD) cases, and creates the need for scalable, cost-effective methods that are able to detect early stage AD. ○ Speech and language biomarkers are strong indicators of dementia, and provide a low-cost and widespread alternative for the assessment of cognitive states. ● The Second Dicova Challenge: Dataset and Performance Analysis for Diagnosis of Covid-19 Using Acoustics ○ Dataset of audio recordings consisting of breathing, cough and speech signals ○ Providing a point-of-care, rapid, easy to use, and cost-effective tool to help contain COVID-19 spread.
  36. 36. Voice Conversion ● Robust disentangled variational speech representation learning for zero-shot voice conversion (Tencent) ○ Feeding an arbitrary speaker embedding and content embeddings to the VAE decoder ● Controllable Speech Representation Learning Via Voice Conversion and AIC Loss (Adobe) ○ Its disentangled components (content, pitch, speaker identity, and energy) can be controlled independently to alter the synthesis result. ● An Investigation of Streaming Non-Autoregressive sequence-to-sequence Voice Conversion ● Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module (Amazon) ○ It uses voice conversion (VC) as a post-processing module appended to a pre-existing high-quality TTS system, framing the few-shot TTS problem as a VC task.
  37. 37. Music Applications ● HiFi-SVC: Fast High Fidelity Cross-Domain Singing Voice Conversion ● Music Enhancement via Image Translation and Vocoding (Adobe) ● Source Separation By Steering Pretrained Music Models ● MELONS: generating melody with long-term structure using transformers and structure graph ● Genre-Conditioned Long-Term 3D Dance Generation Driven by Music ● Deep Performer: Score-to-Audio Music Performance Synthesis (Dolby) ● SleepGAN: Towards Personalized Sleep Therapy Music (Nokia) ● Modeling beats and downbeats with a time-frequency Transformer (ByteDance)
  38. 38. Quantum Machine Learning ● Languages: Google Cirq / Microsoft Q# / IBM Qiskit ● Services: Google Quantum AI / Azure Quantum / IBM Quantum ● The dawn of quantum natural language processing ○ We successfully train a quantum-enhanced Long Short-Term Memory network to perform the parts-of-speech tagging task via numerical simulations. ○ Practical applications are more likely to be a hybrid of classical and quantum operations. This hybrid approach is not too different from what has been done in the past decade with GPUs. ○ The main idea behind Quantum Machine Learning (QML) is to replace parts of a neural network (e.g. linear layers) with a quantum counterpart. ● Quantum federated learning with quantum data ○ Hybrid models fall short when dealing with the highly complex purely quantum data. ○ Thus, purely quantum ML models that can address these challenges were developed, such as quantum neural networks (QNNs). ○ However, due to the fragile nature of the carriers of quantum data, i.e., qubits, there is a natural need for distributed learning solutions such as federated learning (FL).
  39. 39. Machine Learning is All You Need ● Audio Representations ○ Learnable Wavelet Packet Transform for Data-Adapted Spectrograms ● Encodings ○ A Low-Parametric Model for Bit-Rate Estimation of VVC Residual Coding ○ Low-Complexity Multi-Model CNN in-Loop Filter for AVS3 ● Digital Signal Processing ○ Learning Structured Sparsity For Time-Frequency Reconstruction ○ Learning Approach For Fast Approximate Matrix Factorizations ● Communication Systems ○ Adaptive Wireless Power Allocation with Graph Neural Networks ○ Deep Joint Source-Channel Coding for Wireless Image Transmission with Adaptive Rate Control ● Beamforming ○ Deep learning for location based beamforming with NLOS channels ○ Phase-Only Reconfigurable Sparse Array Beamforming Using Deep Learning
  40. 40. II. Topics related to our tasks
  41. 41. Multilingualism / Cross-lingualism
  42. 42. Joint Unsupervised and Supervised Training for Multilingual ASR (Google) ● Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard supervised fine-tuning resumes in the second stage. ● In this paper, we propose an end-to-end (E2E) Joint Unsupervised and Supervised Training (JUST) method to combine the supervised loss and the self-supervised contrastive and masked language modeling (MLM) losses. ● Spectrogram + Quantizer (wav2vec 2.0) + RNN-T ● Wins over XLSR-53!
  43. 43. Pseudo-Labeling for Massively Multilingual Speech Recognition (Facebook) ● Prev works (from Facebook, similar authors) ○ Iterative Pseudo-Labeling for Speech Recognition (IPL) → LM + beam search to generate pseudo labels ○ slimIPL: Language-model-free iterative pseudo-labeling (slimIPL) → Use self-predictions ● Utilizing unlabeled data is helpful, even with trivial methods.
  44. 44. Multilingual Text-To-Speech Training Using Cross Language Voice Conversion And Self-Supervised Learning Of Speech Representations (Facebook) ● It’s hard to find speakers who have native proficiency in several languages. ● Using HifiGAN-like model to augment data (Synthetic generation of target speaker speaking different language)
  45. 45. A Configurable Multilingual Model is All You Need to Recognize All Languages (Microsoft) ● Configurable multilingual model (CMM) to recognize speech from any combination of languages based on a multi-hot LID vector selected by users ● Language-specific vocabulary strategy (making vocab smaller) ● Language-specific transformer cell (one per language)
  46. 46. Zero-Shot Cross-Lingual Transfer Using Multi-Stream Encoder and Efficient Speaker Representation (Tencent) ● Extract speaker embedding features that are independent of both content information and language identity. ● Multi-stream = Input text sequences are fed into N-stream text encoders in parallel ● zero-shot cross-lingual transfer strategy = fine-tune also with target-lingual data + language-balanced sampling strategy
  47. 47. Tackling data scarcity in speech translation using zero-shot multilingual machine translation techniques ● To tackle data scarcity, it is useful to make use of ASR and MT data for end-to-end ST models. We explore techniques from zero-shot multilingual text translation and apply them to speech side. ● Use tokens & augmentation methods to make the model decide output language based on language tokens.
  48. 48. Multi-Lingual Multi-Task Speech Emotion Recognition Using wav2vec 2.0 ● Multi-task learning to increase emotion recognition performance ● Additional tasks ○ Gender Prediction (Ge) ○ Language Prediction (La) ○ F0 mean and standard deviation regression task (F0-me, F0-st) ○ Energy mean and standard deviation regression task (En–me, En-st) ○ Voice ratio regression task (Vr)
  49. 49. ADIMA: Abuse Detection In Multilingual Audio ● ADIMA, a novel, linguistically diverse, ethically sourced, expert annotated and wellbalanced multilingual abuse detection audio dataset comprising of 11,775 audio samples in 10 Indic languages spanning 65 hours and spoken by 6,446 unique users.
  50. 50. SERAB: A multi-lingual benchmark for speech emotion recognition ● Speech Emotion Recognition Adaptation Benchmark (SERAB), a framework for evaluating the performance and generalization capacity of different approaches for utterance-level SER.
  51. 51. Keyword Spotting
  52. 52. Why still keyword spotting? ● For many voice-enabled platforms, queries follow a highly Zipfian distribution. On the Comcast X1 entertainment system, for example, the top-20 commands constitute around 30% of the traffic. ● Using an ASR system is excessive for targeting phonetically distinct commands with a small vocabulary. ● Audio-only based wake word spotting (WWS), a special case of KWS, is challenging under noisy conditions due to the environmental interference.
  53. 53. Temporal early exiting for streaming speech commands recognition (Comcast) ● Additionally add prediction heads, stop inference mid-way based on entropy.
  54. 54. A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning ● Audio-visual keyword spotting ● Using both is helpful
  55. 55. Text Adaptive Detection for Customizable Keyword Spotting ● Novel text adaptive detection framework to directly formulate KWS as a detection rather than a classification problem ● Text prompt is used as input, i.e., customizable wake words
  56. 56. Joint Ego-Noise Suppression and Keyword Spotting on Sweeping Robots (Alibaba) ● a novel approach for joint ego-noise (self-created noise) suppression and keyword detection ● Small footprint keyword spotting (KWS) on sweeping robot, i.e., the conversation triggering module of the audio interface ● A circular microphone array of M = 6 → Multiple minimum variance distortionless response (MVDR) beamformers ● If the keyword is present, noise adaptation will be slowed down to prevent keyword speech being cancelled.
  57. 57. Unified Speculation, Detection, and Verification Keyword Spotting (Alexa) ● Speculation → early decision (giving a head start, reduce system latency) ● Detection → keyword trigger task, more accurate decision ● Verification → verifies previous decision (correct mistakes) ● The proposed latency-aware max-pooling loss can control latency accuracy trade-off effectively.
  58. 58. Few-shot / Low-shot
  59. 59. An Adapter Based Pre-Training for Efficient and Scalable Self-Supervised Speech Representation Learning (Huawei) Apply adapters (B) to original w2v2 (A) to combat language forgetting. https://www.notion.so/hpcnt/An-Adapter-Based-Pre-Training-for-Efficient-and-Scalable-Self-Supervised-Speech-Representation-Learn-004 6747a578d4899b914e520959e01e8
  60. 60. Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition (Huawei) ● Fine-tune on ASR task ● Apply adapters
  61. 61. Large-scale ASR Domain Adaptation by Self-and Semi-supervised Learning (Google) ● Joint training with both RNN-T & Self-supervised loss (wav2vec 2.0) ● Confidence Estimation Module (CEM) → To filter out low confidence samples in pseudo-labels for Noisy student training ○ binary cross entropy between the estimated confidence p and the binary target sequence c ● It utilizes Wav2vec2.0 loss on the causal encoder, so there is no transition gap from non-causal to causal.
  62. 62. Learning Domain-Invariant Transformation for Speaker Verification ● Meta-learning to generate domain-invariant embeddings without pre-training and fine-tuning ● Use both metric loss & classification loss together
  63. 63. Magic dust for cross-lingual adaptation of monolingual wav2vec-2.0 ● Monolingual wav2vec-2.0 is a good few-shot ASR learner in several languages ○ English → 8 Target languages ○ Performance up to 86% compared to XLSR ○ ASR Fine-Tuning on English hurts other languages ● Monolingual wav2vec2 model pre-trained on a high-resource language using moderately-sized unlabeled data and small-sized labeled data in the target language yields similar performance at XLSR ● Dropout Uncertainty-Driven Self-Training (DUST) ○ Leverages unlabeled data by pseudo-labeling (semi-supervised) ○ Student from a previous round becomes the teacher for the next round
  64. 64. Audio Augmentation
  65. 65. Filteraugment: An acoustic environmental data augmentation method ● FilterAugment mimics acoustic filters by applying different weights on frequency bands, therefore enables model to extract relevant information from wider frequency region. ● Improved version of frequency masking which masks information on random frequency bands.
  66. 66. Auditory-Based Data Augmentation for end-to-end Automatic Speech Recognition ● Spectral smearing smooths the speech spectrum and suppresses details by broadening the bandwidths of the auditory filters. ● Loudness recruitment compresses amplitudes of different frequency bands, simulates damaged ear.
  67. 67. Intermix: An Interference-Based Data Augmentation and Regularization Technique for Automatic Deep Sound Classification ● Prev work: BC learning ○ Taking sound energy into account ● Prev work: SpeechMix ○ Similar to manifold mixup, mix intermediate representations ● This work: InterMix ○ Also apply phase shifts to inputs & use it when mixing
  68. 68. Robust Speaker Verification Using Population-Based Data Augmentation ● A population-based searching strategy for optimizing the augmentation parameters. ● Instead of finding a fixed set of hyper-parameters, PBA learns a scheduler for setting the hyper-parameters. ● List of augmentation used ○ Reverberation: Convolve with room impulse response (RIR) ○ Music: Music from a randomly selected MUSAN ○ Noise: Noise from MUSAN is added ○ Babble: Babble noise is added ○ Frequency masking ○ Time masking
  69. 69. Various augmentations ● LPC Augment: an LPC-based ASR Data Augmentation Algorithm for Low and Zero-Resource Children's Dialects ○ The data augmentation procedure consists of perturbing the formant peaks of the Linear predictive coding (LPC) spectrum during LPC analysis and reconstruction. ○ Compared with SpegAug & Speed perturbation. Did not show absolute advantage. ● ImportantAug: A Data Augmentation Agent for Speech ○ Adding noise to unimportant regions of the speech and not to important regions. ○ Importance is predicted for each utterance by a data augmentation agent that is trained to maximize the amount of noise it adds while minimizing its impact on recognition performance. ● Fraug: A Frame Rate Based Data Augmentation Method for Depression Detection from Speech Signals ○ Changing the frame-width and the frame-shift parameters during the feature extraction process
  70. 70. Task-specific Augmentations ● Cross-speaker style transfer for text-to-speech using data augmentation ○ Cross-speaker style transfer for TTS using data augmentation via voice conversion ● Spatial mixup: Directional loudness modification as data augmentation for sound event localization and detection ○ application of parametric spatial audio effects for data augmentation, which modifies the directional properties of a multi-channel spatial audio signal encoded in the ambisonics domain. ● Spatial Data Augmentation with Simulated Room Impulse Responses for Sound Event Localization and Detection ○ Augments spatial characteristics using simulated room impulse responses (RIR). simulated RIRs are convolved with the source signals to obtain an augmented multi-channel training dataset. ● Distribution augmentation for low-resource expressive text-to-speech ○ Data augmentation through word permutations & Constituency parse based tree substitutions
  71. 71. Federated Learning
  72. 72. Federated learning challenges and opportunities: An outlook (Amazon Alexa) ● Finding the lower limit of the number of communication rounds ○ Many local updates (for communication efficiency) can still converge to a desirable model. ○ Overly aggressive local updates will harm the performance due to the data heterogeneity ● Constraints ○ Memory constraint (each on-device model needs to be small in size) ○ Computation constraint (devices may perform only a limited number of gradient updates) ● Personalized FL ○ Conventional FL trains one model, personalized FL maintains a collection of client-specific models ○ Will reduce test errors beyond what is possible with a single global model. ● Challenges of Lifelong FL ○ Online updates with single-pass data ○ Coupling of model training and data generation. ● Challenges on data ○ Data polarity (collected data does not represent the whole data distribution) ○ Data dependency (data are collected from time series with inevitable dependency)
  73. 73. Learnings from Federated Learning in the Real world (Alexa) ● Skewness: “heavy devices” with large amounts of data while there are many “light users” with only a handful of data points. ● Non-uniform device selection outperforms uniform sampling of FL where it utilizes the number of input points per device. ● We compare one-shot FL (Uses full range of data, single training) with continual FL (Avoid storing data, multiple training rounds). We show that continual FL outperforms the one-shot strategy in some setting, and is overall most beneficial for heavy devices.
  74. 74. Enabling on-device training of speech recognition models with federated dropout (Google) ● Communication/computation costs are strongly correlated with the size of the model being trained. We propose using federated dropout to reduce the size of client models while training a full-size model server-side. ● Furthermore, we find that federated dropout makes smaller sub-models to have lower WER, making it easier to dynamically adjust the model size. ● We use a realistic setting for federated training of ASR models, wherein a well trained server-side model is adapted to a new domain with FL on edge devices.
  75. 75. Federated Self-supervised Learning ● Federated Self-Training for Data-Efficient Audio Recognition (Philips Research) ○ Self-training approach to exploit large-scale on-device unlabeled data to improve the generalization of audio recognition models ○ Generate pseudo labels & train with softened labels ● Federated Self-Supervised Learning for Acoustic Event Classification (Amazon) ○ Applying FL to improve acoustic event classification (AEC) performance while no customer data can be directly uploaded to the server ○ No pseudo labels (Common in AEC) ○ Solve the task of predicting the future audio frame via feature representation
  76. 76. EOD

×