The document summarizes key topics from ICASSP 2022, including general trends in speech and audio processing, self-supervised and contrastive learning approaches, security applications, and topics related to tasks like multilingualism and keyword spotting. Some of the main models and techniques discussed are Wav2vec, HuBERT, contrastive learning using Conformers, intermediate layer supervision in self-supervised learning, and anonymization of speech data for privacy.
The document contains mathematical equations and notation related to machine learning and probability distributions. It involves defining terms like P(y|x), which represents the probability of outcome y given x, and exploring ways to calculate the expected value of an objective function Rn under different probability distributions p and q over the variables x and y. The goal appears to be to select parameters θ to optimize some objective while accounting for the distributions of the training data.
The document contains mathematical equations and notation related to machine learning and probability distributions. It involves defining terms like P(y|x), which represents the probability of outcome y given x, and exploring ways to calculate the expected value of an objective function Rn under different probability distributions p and q over the variables x and y. The goal appears to be to select parameters θ to optimize some objective while accounting for the distributions of the training data.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...Deep Learning JP
1. The document discusses a research paper on speech enhancement using a convolutional gated recurrent network (CGRN) and ordered neuron long short-term memory (ON-LSTM).
2. The proposed method aims to improve speech quality by incorporating both time and frequency dependencies using CGRN, and handling noise with varying change rates using ON-LSTM.
3. CGRN replaces fully-connected layers with convolutions, allowing it to capture local spatial structures in the frequency domain. ON-LSTM groups neurons based on the change rate of internal information to model hierarchical representations.
Presentation slide for AI seminar at Artificial Intelligence Research Center, The National Institute of Advanced Industrial Science and Technology, Japan.
URL (in Japanese): https://www.airc.aist.go.jp/seminar_detail/seminar_046.html
Semi supervised, weakly-supervised, unsupervised, and active learningYusuke Uchida
An overview of semi supervised learning, weakly-supervised learning, unsupervised learning, and active learning.
Focused on recent deep learning-based image recognition approaches.
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
Guest presentation at "Applied Gaussian Process and Machine Learning," Graduate School of Information Science and Technology, The University of Tokyo, Japan, 2021.
This document discusses generative adversarial networks (GANs) and their relationship to reinforcement learning. It begins with an introduction to GANs, explaining how they can generate images without explicitly defining a probability distribution by using an adversarial training process. The second half discusses how GANs are related to actor-critic models and inverse reinforcement learning in reinforcement learning. It explains how GANs can be viewed as training a generator to fool a discriminator, similar to how policies are trained in reinforcement learning.
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...NU_I_TODALAB
APSIPA ASC 2021
Ding Ma, Wen-Chin Huang, Tomoki Toda: Investigation of text-to-speech-based synthetic parallel data for sequence-to-sequence non-parallel voice conversion, Dec. 2021
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
This document summarizes a presentation about variational autoencoders (VAEs) presented at the ICLR 2016 conference. The document discusses 5 VAE-related papers presented at ICLR 2016, including Importance Weighted Autoencoders, The Variational Fair Autoencoder, Generating Images from Captions with Attention, Variational Gaussian Process, and Variationally Auto-Encoded Deep Gaussian Processes. It also provides background on variational inference and VAEs, explaining how VAEs use neural networks to model probability distributions and maximize a lower bound on the log likelihood.
The document proposes an active learning method for DNN speaker embedding that considers subjective speaker similarity. It aims to learn speaker representations suitable for generative tasks like speech synthesis with reduced cost compared to conventional methods. The method alternates between collecting subjective similarity scores through crowdsourcing and training a DNN speaker encoder using the scores. It selects the next speaker pair to score based on the predicted similarity from the current encoder model, aiming to learn good representations with fewer scored pairs and training iterations. Experimental evaluation investigates the impact of different query strategies for pair selection.
Crafting Recommenders: the Shallow and the Deep of it! Sudeep Das, Ph.D.
Sudeep Das presented on recommender systems and advances in deep learning approaches. Matrix factorization is still the foundational method for collaborative filtering, but deep learning models are now augmenting these approaches. Deep neural networks can learn hierarchical representations of users and items from raw data like images, text, and sequences of user actions. Models like wide and deep networks combine the strengths of memorization and generalization. Sequence models like recurrent neural networks have also been applied to sessions for next item recommendation.
2017 Tutorial - Deep Learning for Dialogue SystemsMLReview
In the past decade, goal-oriented spoken dialogue systems (SDS) have been the most promi-nent component in today’s virtual personal assistants (VPAs). Among these VPAs, Microsoft’s Cortana, Apple’s Siri, Amazon Alexa, Google Assistant, and Facebook’s M, have incorporated SDS modules in various devices, which allow users to speak naturally in order to finish tasks more efficiently. The traditional conversational systems have rather complex and/or modular pipelines. The advance of deep learning technologies has recently risen the applicatins of neural models to dialogue modeling. Nevertheless, applying deep learning technologies for building robust and scalable dialogue systems is still a challenging task and an open research area as it requires deeper understanding of the classic pipelines as well as detailed knowledge on the benchmark of the models of the prior work and the recent state-of-the-art work. Thus, this tutorial is designed to focus on an overview of the dialogue system development while describing most recent research for building dialogue systems, and summarizing the challenges. We target an audience of students and practitioners who have some deep learning background and want to get more familiar with conversational dialog systems.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...Deep Learning JP
1. The document discusses a research paper on speech enhancement using a convolutional gated recurrent network (CGRN) and ordered neuron long short-term memory (ON-LSTM).
2. The proposed method aims to improve speech quality by incorporating both time and frequency dependencies using CGRN, and handling noise with varying change rates using ON-LSTM.
3. CGRN replaces fully-connected layers with convolutions, allowing it to capture local spatial structures in the frequency domain. ON-LSTM groups neurons based on the change rate of internal information to model hierarchical representations.
Presentation slide for AI seminar at Artificial Intelligence Research Center, The National Institute of Advanced Industrial Science and Technology, Japan.
URL (in Japanese): https://www.airc.aist.go.jp/seminar_detail/seminar_046.html
Semi supervised, weakly-supervised, unsupervised, and active learningYusuke Uchida
An overview of semi supervised learning, weakly-supervised learning, unsupervised learning, and active learning.
Focused on recent deep learning-based image recognition approaches.
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
Guest presentation at "Applied Gaussian Process and Machine Learning," Graduate School of Information Science and Technology, The University of Tokyo, Japan, 2021.
This document discusses generative adversarial networks (GANs) and their relationship to reinforcement learning. It begins with an introduction to GANs, explaining how they can generate images without explicitly defining a probability distribution by using an adversarial training process. The second half discusses how GANs are related to actor-critic models and inverse reinforcement learning in reinforcement learning. It explains how GANs can be viewed as training a generator to fool a discriminator, similar to how policies are trained in reinforcement learning.
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...NU_I_TODALAB
APSIPA ASC 2021
Ding Ma, Wen-Chin Huang, Tomoki Toda: Investigation of text-to-speech-based synthetic parallel data for sequence-to-sequence non-parallel voice conversion, Dec. 2021
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
This document summarizes a presentation about variational autoencoders (VAEs) presented at the ICLR 2016 conference. The document discusses 5 VAE-related papers presented at ICLR 2016, including Importance Weighted Autoencoders, The Variational Fair Autoencoder, Generating Images from Captions with Attention, Variational Gaussian Process, and Variationally Auto-Encoded Deep Gaussian Processes. It also provides background on variational inference and VAEs, explaining how VAEs use neural networks to model probability distributions and maximize a lower bound on the log likelihood.
The document proposes an active learning method for DNN speaker embedding that considers subjective speaker similarity. It aims to learn speaker representations suitable for generative tasks like speech synthesis with reduced cost compared to conventional methods. The method alternates between collecting subjective similarity scores through crowdsourcing and training a DNN speaker encoder using the scores. It selects the next speaker pair to score based on the predicted similarity from the current encoder model, aiming to learn good representations with fewer scored pairs and training iterations. Experimental evaluation investigates the impact of different query strategies for pair selection.
Crafting Recommenders: the Shallow and the Deep of it! Sudeep Das, Ph.D.
Sudeep Das presented on recommender systems and advances in deep learning approaches. Matrix factorization is still the foundational method for collaborative filtering, but deep learning models are now augmenting these approaches. Deep neural networks can learn hierarchical representations of users and items from raw data like images, text, and sequences of user actions. Models like wide and deep networks combine the strengths of memorization and generalization. Sequence models like recurrent neural networks have also been applied to sessions for next item recommendation.
2017 Tutorial - Deep Learning for Dialogue SystemsMLReview
In the past decade, goal-oriented spoken dialogue systems (SDS) have been the most promi-nent component in today’s virtual personal assistants (VPAs). Among these VPAs, Microsoft’s Cortana, Apple’s Siri, Amazon Alexa, Google Assistant, and Facebook’s M, have incorporated SDS modules in various devices, which allow users to speak naturally in order to finish tasks more efficiently. The traditional conversational systems have rather complex and/or modular pipelines. The advance of deep learning technologies has recently risen the applicatins of neural models to dialogue modeling. Nevertheless, applying deep learning technologies for building robust and scalable dialogue systems is still a challenging task and an open research area as it requires deeper understanding of the classic pipelines as well as detailed knowledge on the benchmark of the models of the prior work and the recent state-of-the-art work. Thus, this tutorial is designed to focus on an overview of the dialogue system development while describing most recent research for building dialogue systems, and summarizing the challenges. We target an audience of students and practitioners who have some deep learning background and want to get more familiar with conversational dialog systems.
#1 Berlin Students in AI, Machine Learning & NLP presentationparlamind
For the first ever Meetup of Berlin Students in AI, Machine Learning & NLP Dr. Tina Klüwer (CTO at parlamind.com and Nuria Bertomeu Castello (CSO) gave and introductory presentation on conversational intelligence.
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Fwdays
In this talk I'll start by introducing the recent breakthroughs in NLP that resulted from the combination of Transfer Learning schemes and Transformer architectures. The second part of the talk will be dedicated to an introduction of the open-source tools released by Hugging Face, in particular our transformers, tokenizers, and NLP libraries as well as our distilled and pruned models.
Learning with limited labelled data in NLP: multi-task learning and beyondIsabelle Augenstein
When labelled training data for certain NLP tasks or languages is not readily available, different approaches exist to leverage other resources for the training of machine learning models. Those are commonly either instances from a related task or unlabelled data.
An approach that has been found to work particularly well when only limited training data is available is multi-task learning.
There, a model learns from examples of multiple related tasks at the same time by sharing hidden layers between tasks, and can therefore benefit from a larger overall number of training instances and extend the models' generalisation performance. In the related paradigm of semi-supervised learning, unlabelled data as well as labelled data for related tasks can be easily utilised by transferring labels from labelled instances to unlabelled ones in order to essentially extend the training dataset.
In this talk, I will present my recent and ongoing work in the space of learning with limited labelled data in NLP, including our NAACL 2018 papers 'Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces [1] and 'From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings’ [2].
[1] https://t.co/A5jHhFWrdw
[2] https://arxiv.org/abs/1802.09375
==========
Bio from my website http://isabelleaugenstein.github.io/index.html:
I am a tenure-track assistant professor at the University of Copenhagen, Department of Computer Science since July 2017, affiliated with the CoAStAL NLP group and work in the general areas of Statistical Natural Language Processing and Machine Learning. My main research interests are weakly supervised and low-resource learning with applications including information extraction, machine reading and fact checking.
Before starting a faculty position, I was a postdoctoral research associate in Sebastian Riedel's UCL Machine Reading group, mainly investigating machine reading from scientific articles. Prior to that, I was a Research Associate in the Sheffield NLP group, a PhD Student in the University of Sheffield Computer Science department, a Research Assistant at AIFB, Karlsruhe Institute of Technology and a Computational Linguistics undergraduate student at the Department of Computational Linguistics, Heidelberg University.
Transfer learning in NLP involves pre-training large language models on unlabeled text and then fine-tuning them on downstream tasks. Current state-of-the-art models such as BERT, GPT-2, and XLNet use bidirectional transformers pretrained using techniques like masked language modeling. These models have billions of parameters and require huge amounts of compute but have achieved SOTA results on many NLP tasks. Researchers are exploring ways to reduce model sizes through techniques like distillation while maintaining high performance. Open questions remain around model interpretability and generalization.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
1. The document discusses various applications of deep learning algorithms for speaker identification and recognition, including convolutional deep belief networks (CDBN) and deep neural networks (DNN).
2. CDBN was shown to outperform traditional MFCC and raw features for audio classification tasks including speech and music recognition.
3. DNN approaches have demonstrated lower error rates than GMM-HMM models for speech recognition across multiple languages.
4. SIDEKIT is an open source Python toolkit that can implement state-of-the-art methods for speaker identification, including GMM-HMM, and has potential to incorporate DNN approaches.
Conversational Agents in Portuguese: A Study Using Deep LearningAndherson Maeda
1) The document discusses research on developing a conversational agent in Portuguese using deep learning techniques. It provides background on chatbots and machine learning approaches.
2) The researcher collected various corpora in Portuguese to train sequence-to-sequence models for question answering. Tests were run using different corpora and model architectures.
3) Preliminary results from user tests with 90 participants were positive based on 1,502 conversation pairs, though the researcher identifies areas for future improvement including using more advanced techniques and evaluating on additional Portuguese data.
Deep generative and discriminative models for speech recognition. The document outlines the history of speech recognition models including early neural networks, hidden dynamic models, and deep belief networks. It describes how deep learning entered speech recognition around 2009 through the collaboration of Microsoft Research and academics. This led to replacing generative models with discriminative deep neural networks which achieved large error reductions. The talk outlines further innovations in deep learning for speech including context-dependent models and better optimization techniques.
neural based_context_representation_learning_for_dialog_act_classificationJEE HYUN PARK
The document presents a neural network model for dialog act classification that incorporates context representations. It uses a CNN to represent each utterance, applies an internal attention mechanism, and models context with RNNs. As baselines, it uses a single-utterance CNN and concatenation of utterances. Results show RNNs better learn context representations and attention mechanisms improve performance, though the optimal attention placement depends on the dataset. The best-performing models outperform the previous state-of-the-art on benchmark datasets.
Deep Learning in practice : Speech recognition and beyond - MeetupLINAGORA
Retrouvez la présentation de notre Meetup du 27 septembre 2017 présenté par notre collaborateur Abdelwahab HEBA : Deep Learning in practice : Speech recognition and beyond
Deep learning: the future of recommendationsBalázs Hidasi
An informative talk about deep learning and its potential uses in recommender systems. Presented at the Budapest Startup Safary, 21 April, 2016.
The breakthroughs of the last decade in neural network research and the quick increasing of computational power resulted in the revival of deep neural networks and the field focusing on their training: deep learning. Deep learning methods have succeeded in complex tasks where other machine learning methods have failed, such as computer vision and natural language processing. Recently deep learning has began to gain ground in recommender systems as well. This talk introduces deep learning and its applications, with emphasis on how deep learning methods can solve long standing recommendation problems.
This document discusses using machine learning techniques like neural networks to help decipher ancient scripts and languages. It describes how character-level sequence-to-sequence models can be used to identify cognates between related languages. Additional techniques like network flows and dynamic programming are used to model monotonic character alignments and jointly segment and match tokens between known and unknown languages. The approaches are able to identify cognates between languages like Ugaritic and Hebrew as well as segment and match the unknown Iberian language. Neural models that incorporate linguistic features like phonological embeddings are shown to improve decipherment performance.
This document provides an outline for a presentation on machine learning and deep learning. It begins with an introduction to machine learning basics and types of learning. It then discusses what deep learning is and why it is useful. The main components and hyperparameters of deep learning models are explained, including activation functions, optimizers, cost functions, regularization methods, and tuning. Basic deep neural network architectures like convolutional and recurrent networks are described. An example application of relation extraction is provided. The document concludes by listing additional deep learning topics.
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupYves Peirsman
It’s often said we live in the age of big data. Therefore, it may come as a surprise that in the field of natural language processing, machine learning professionals are often faced with data scarcity. Many organizations that would like to apply NLP lack a sufficiently large collection of labeled text in their language or domain to train a high-quality NLP model.
Luckily, there’s a wide variety of ways to address this challenge. First, approaches such as active learning reduce the number of training instances that have to be labeled in order to build a high-quality NLP model. Second, techniques such as distant supervision and proxy-label approaches can help label training examples automatically. Finally, recent developments in semisupervised learning, transfer learning, and multitask learning help models improve by making better use of unlabeled data or training them on several tasks at the same time.
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Márton Miháltz
A brief survey of current deep learning/neural network methods currently used in NLP: recurrent networks (LSTM, GRU), recursive networks, convolutional networks, hybrid architectures, attention models. We will look at specific papers in the literature, targeting sentiment analysis, text classification and other tasks.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit-google-keynote
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Jeff Dean, Senior Fellow at Google, presents the "Large-Scale Deep Learning for Building Intelligent Computer Systems" keynote at the May 2016 Embedded Vision Summit.
Over the past few years, Google has built two generations of large-scale computer systems for training neural networks, and then applied these systems to a wide variety of research problems that have traditionally been very difficult for computers. Google has released its second generation system, TensorFlow, as an open source project, and is now collaborating with a growing community on improving and extending its functionality. Using TensorFlow, Google's research group has made significant improvements in the state-of-the-art in many areas, and dozens of different groups at Google use it to train state-of-the-art models for speech recognition, image recognition, various visual detection tasks, language modeling, language translation, and many other tasks.
In this talk, Jeff highlights some of ways that Google trains large models quickly on large datasets, and discusses different approaches for deploying machine learning models in environments ranging from large datacenters to mobile devices. He will then discuss ways in which Google has applied this work to a variety of problems in Google's products, usually in close collaboration with other teams. This talk describes joint work with many people at Google.
This document discusses key concepts in visual transformers including key-value-query attention, pooling, multi-head attention, and unsupervised representation learning. It then summarizes several state-of-the-art papers applying transformers to computer vision tasks like image classification using ViT, object detection using DETR, and generative pretraining from pixels. Additional works extending visual transformers to tasks like segmentation, video analysis, and captioning are also briefly mentioned.
This document contains a list of users and the items associated with each user. There are multiple users listed, each with several items. The items are repeated for each user.
The document contains a list of user items and users. It includes multiple entries for "User Item" as well as listings for individual users and items. At the bottom, it notes that some users and items are part of the "World" and connects to a user listed as being from the University of Cambridge.
The document describes a hackathon called "The AI Winter Hackathon: PotatoHunter" where participants built models to classify images of potatoes as either normal or sweet potatoes. It provides details on the event location, datasets used for training and testing models, and results showing the classification error rates of different submissions on the test data.
The document discusses the history and development of computing, including important figures such as Babbage, Lovelace, Boole, Shannon, and von Neumann. It highlights how Babbage invented the Difference Engine and Analytical Engine, laying the foundation for modern computers. It also discusses how Boolean algebra was applied to circuitry by Claude Shannon, and how von Neumann architecture separated memory and processing, influencing modern computer design.
This document provides an overview of Gaussian processes. It begins with definitions of normal distributions and their properties. It then discusses the central limit theorem and how it relates to multivariate Gaussian distributions. Different visualizations of Gaussian processes are presented, showing how they can model non-linear regression by representing distributions over functions. The document concludes by discussing how Gaussian processes are mathematically equivalent to infinitely wide neural networks, and how this connection has been extended to deeper networks.
Neural Architecture Search: Learning How to LearnKwanghee Choi
Neural Architecture Search aims to automate the design of neural networks. The document discusses several papers that developed methods for neural architecture search using reinforcement learning and evolutionary algorithms. These methods led to the discovery of neural network cells that achieved state-of-the-art performance on image classification tasks when combined into larger networks. Later work explored ways to make neural architecture search more efficient and applicable to different tasks.
This document discusses the duality between object-oriented programming (OOP) and reinforcement learning (RL) from three perspectives:
I. The OOP perspective views software objects as autonomous and active, encapsulating states that change through behaviors in response to messages. Good objects cooperate through open behaviors while maintaining autonomy.
II. The RL perspective frames an agent interacting with its environment to maximize rewards through sequential actions. The agent's history and state are abstractions that allow it to determine optimal actions.
III. There is a duality between the two perspectives in that the agent and environment objects affect each other through feedback loops of messages like actions and observations. Their states represent summarized histories to determine future interactions while maintaining
The JPEG image compression standard works by first converting the image color space to Y'CbCr and subsampling the chroma channels. It then applies the discrete cosine transform to separate the image into spatial frequencies. Quantization more heavily reduces the higher frequency components, capitalizing on human visual perception being less sensitive to color and fine details. Run-length encoding groups common values, and Huffman coding further compresses the data into an efficient binary representation for storage and transmission.
Bandit algorithms for website optimization - A summaryKwanghee Choi
Bandit algorithms aim to balance exploration of new options with exploitation of existing best options. The ε-greedy algorithm tries to be fair to exploration and exploitation but has issues with its fixed ε value. The softmax algorithm calculates choice probabilities based on accumulated rewards and a temperature parameter to control exploration vs exploitation. The UCB algorithm chooses options based on accumulated rewards plus an exploration bonus, making it explicitly curious while avoiding being misled by early results. Real-world use involves additional complexities around concurrent experiments, dynamic metrics and environments. Overall bandit algorithms require domain expertise and judgment in application.
Dummy log generation using poisson samplingKwanghee Choi
This document discusses generating dummy log data using Poisson sampling. It describes modeling log counts per hour as a Poisson distribution, which can be used to simulate logs appearing randomly over time. The implementation allows generating logs either at a constant rate (homogeneous Poisson process) or at a varying rate over time (inhomogeneous Poisson process). The results are dummy log data that fits the target distribution of real log counts per hour.
Serverless computing is a cloud computing model where the cloud provider manages resources dynamically based on application demand. Customers pay based on actual resource usage rather than pre-purchased capacity units. While servers are still required, serverless computing aims to abstract away server management. The document then provides examples of serverless platforms like Azure Functions, AWS Lambda, and Google Cloud Functions. It also outlines a sample project using serverless technologies like Azure Functions and Logic Apps to build a custom RSS feed service that can schedule jobs, parse feeds, notify subscribers of updates, and allow adding new subscribers.
Jpl coding standard for the c programming languageKwanghee Choi
The document describes the JPL Coding Standard for C programming language. It defines rules across several levels of compliance (LOC) that primarily target the development of mission critical flight software. The rules focus on aspects like language compliance, predictable execution, defensive coding and code clarity. They address issues like loop bounds, recursion, memory protection, assertions and limiting preprocessor usage. Real-world examples like the Toyota vehicle recalls caused by unintended acceleration are also discussed, highlighting the importance of following coding standards.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
9. Towards Learning Universal Audio Representations
(DeepMind)
● HARES: New BLEU-like benchmark
● SlowFast (from Video) + NFNet
(from Vision) seems to be great.
○ SlowFast: Two branches with
bigger/smaller kernel width
○ NFNet: Normalizer-Free ResNets
● CPC (contrastive learning) works
quite well.
https://www.notion.so/hpcnt/Towards-Learning-Universal-Audio-Repres
entations-ed8774b85de143c097175b3646cd84e1
10. Universal paralinguistic speech representations using
self-supervised conformers (Google)
● Contrastive learning (of w2v2) on Conformers
○ Future works already conducted on distilling this model (TRILLsson)
○ https://ai.googleblog.com/2022/03/trillsson-small-universal-speech.html
● Closely following previous work BigSSL (Google)
○ Trained with speech-heavy youtube videos
○ Their conclusion: SSL + Large Models are especially helpful for small datasets
● Best performance wasn’t from the
final layer’s feature vector.
(Same conclusion from BigSSL)
→ CAP12 (12th layer feature outputs)
https://www.notion.so/hpcnt/Universal-paralinguistic-speech-representations-using-self-supervised-conformers-d621d75b95eb
4369ab34cc5237603393
11. A Noise-Robust Self-supervised Pre-training Model Based
Speech Representation Learning for Automatic Speech
Recognition
Making the w2v feature encoder robust to
additional noise via contrastive loss
https://www.notion.so/hpcnt/A-Noise-Robust-Self-supervised-Pr
e-training-Model-Based-Speech-Representation-Learning-for-Aut
omatic-537ec0ccbd874303840b582db90a3a9d
12. Wav2vec-switch: Contrastive learning from original-noisy
speech pairs for robust speech recognition (Microsoft)
● Contextualized representation being robust to noise
Original
w2v2 loss
14. Improving Self-Supervised Learning for Speech Recognition
with Intermediate Layer Supervision (Microsoft)
● The common practice of SSL is to compute the
self-supervised loss on the top layer, such as
wav2vec 2.0 and HuBERT.
● However, the lower layers of such a pre-trained
model is shown to have a low correlation with
phonetic information.
● In this work, we propose to apply intermediate
layer supervision to encourage lower layers to
learn content knowledge → Apply exact same
thing of HUBERT to lower layers
15. Exploring Heterogeneous Characteristics of Layers in ASR
Models for More Efficient Training (Google)
● Based on “Are All Layers Created Equal?” (Bengio)
○ Fix intermediate layers’ weights to some other weights
○ Re-initialization: Come back to initial values
○ Re-randomization: Get random weights
● While ambient layers were present in all model sizes, we observed that larger
models had more ambient layers, i.e., overparameterized models.
● During early rounds, the ambient layers were more
spread throughout the model; only later
the separation become more distinct.
● GN was more robust (against re-random) than BN.
16. Investigation of Robustness of Hubert Features from
Different Layers to Domain, Accent and Language Variations
● Our experiments indicate that as domain, accent,
bandwidth and language deviates from the source domain,
the relative improvement decreases.
● The last layer of HuBERT is very specific to the dataset on
which it is trained. The second last layer seems to be better
when there is domain and accent differences.
● Middle layers are more suited when data is from a different
language.
17. Don't speak too fast: The impact of data bias on
self-supervised speech models (NTU)
● Use SUPERB benchmark to differ Gender, Content, Speech speed of
pre-trained datasets
● Gender → Adding few minor class samples will mitigate performance drop
● Content → Model didn’t care perplexity
● Speech speed → Faster speech is worse
19. Speech anonymization (Emmanuel Vincent)
● Speech information
○ Verbal content (identifiers, private info, etc)
○ Speaker (identity, gender, age, ethnic origin, etc)
○ Nonverbal content (emotion, health, etc)
○ Acoustic environment (acoustics, other speakers, etc)
● Risks
○ User profiling, user identification, voice cloning,
information leakage
● Methods
○ Embedded systems, Cryptography, Obfuscation, Anonymization, Federated Learning, etc
○ Simple modifications (ex. Pitch shifts) utterly fail for knowledgeable attackers
● Current speech anonymization challenge != Legal defn.
○ It seems that many big companies doesn’t anonymize speech (collected from various sources)
○ Task: (1) ASR (2) Emotion recognition
20. Preserving Trajectory Privacy in Driving Data Release
● What comes with the innovative services provided by intelligent transport
systems (ITS) are potential privacy attacks.
● For example, in traffic monitoring systems, individual users send anonymized
personal location traces continuously to aid in traffic state estimation.
● However, an adversary may link an anonymous GPS trace to a particular person
provided additional knowledge of the person’s residence or working location.
● This can not be achieved by data encryption or hiding the driver identity. We
resort to the notion of inference privacy that sanitizes raw data to limit the
amount of contained private information.
21. Audio Deepfake Detection 2022: the First Audio Deep
Synthesis Detection Challenge
● http://addchallenge.cn/
● Low-quality fake audio detection: focuses on dealing with bona fide and fully
fake utterances with various real-world noises etc
○ Fully generated utterances
● Partially fake audio detection: distinguish the partially fake audio from the real
○ Generated by manipulating the genuine utterances
● Audio fake game: Solve both an audio generation task and an audio fake
detection task
22. Aasist: Audio anti-spoofing using integrated
spectro-temporal graph attention networks (Naver)
● Spoofing detection solutions can be an important consideration when
automatic speaker verification systems are deployed in real-world applications.
● Two major scenarios:
○ Logical access (LA): spoofing attacks mounted with voice conversion and TTS
○ Physical access (PA): bona fide utterances are captured and then replayed
● Recent studies show that discriminative information (i.e., spoofing artefacts)
can reside in specific temporal and spectral intervals
23. Characterizing the adversarial vulnerability of speech
self-supervised learning (Helen Meng)
● Speech processing Universal PERformance Benchmark (SUPERB)
○ Upstream model (self-supervised models) + Downstream models (directly uses features, ex.
finetuning)
● Adversarial Attacks
○ Limited-knowledge adversaries: Attackers can access the internals of the target model
(parameters and gradients). But they do not know which downstream task will be conducted.
○ Zero-knowledge adversaries: Target model is unavailable to the attackers. In such a case, the
substitute model is used for approximating gradients for adversarial sample generation.
○ XAB listening test: check if humans can distinguish adversarial samples
● Results: Attacks are effective, humans cannot easily distinguish.
24. Adversarial Sample Detection for Speaker Verification by
Neural Vocoders (Tencent)
● Automatic speaker verification (ASV), one of the most important technology for
biometric identification, has been widely adopted in security-critical
applications.
● However, ASV is seriously vulnerable to recently emerged adversarial attacks,
yet effective countermeasures against them are limited.
25. Source Mixing and Separation Robust Audio Steganography
(Sony)
● Audio steganography is the science of
concealing secret messages inside a host
audio called a carrier in such a way that the
concealment is unnoticeable to human ears.
● Recently, deep neural networks (DNNs) have
been used as a steganographic function for
hiding data inside images to achieve high
capacity.
● The network learns to conceal a hidden
message inside the carrier without manually
specifying a particular redundancy to exploit.
PixInWav: Residual Steganography for
Hiding Pixels in Audio
26. Exploiting language model for efficient linguistic steganalysis
● Linguistic steganography (LS)
○ Natural language is actually quite suitable for steganography.
○ The advantage is that LS can be easily concealed by the huge number of social activities.
○ (1) modification based and (2) generation based
○ Latter allows more data to be embedded
● Steganalysis = to detect whether there is secret data embedded in the media
● Significant difference between automatically generated stego texts and carrier
texts in terms of the conditional probability distribution of individual words.
28. Acoustic Echo Cancellation
● Acoustic echo refers to the phenomenon that occurs when a microphone picks
up the far-end signal that is played by a loudspeaker.
● This phenomenon can cause a slight annoyance or a significant breakdown in a
communication system.
● ICASSP 2022 AEC Challenge by Microsoft
● Various scenarios
○ Long- or varying delays
○ Strong speaker/mic distortions
○ Stationary/non-stationary noise
○ Glitches (due to high CPU usages)
○ etc.
29. Deep Noise Suppression
● Audio calls in the presence of
background noises get
significantly degraded in terms
of quality/intelligibility of the
perceived speech.
● ICASSP 2022 Deep Noise
Suppression Challenge by
Microsoft
30. Multi-Channel Multi-Party Meeting Transcription
● Speaker Diarization
○ Partitioning an input audio stream into homogeneous segments according to the speaker
identity, i.e. "who spoke when?”
● Multi-speaker ASR
○ Hard to do overlapped speech recognition due to the interfering speakers or background noise
● ICASSP 2022 M2MeT Challenge by Alibaba
31. VarArray: Array-geometry-agnostic continuous speech
separation (Microsoft)
● Continuous speech separation using a microphone array was shown to be
promising in dealing with the speech overlap problem.
● Signals highly depend on the position of the microphones.
● In meetings, we can assume only two or fewer speakers to be active for the
majority of the meeting time.
32. Multimodal Systems
● Audio-Visual Object Classification For Human-Robot Collaboration
● Multimodal Information Based Speech Processing
● Machine Translation for Spoken and Written Language
● Image and Video Understanding
● Multimodal Signal Processing, Analysis, and Synthesis
● Audio Security and Multi-Modal Systems
● Multi-modal Analysis and Synthesis
● Multimodal Data Fusion and Processing
● Multimodal Analysis in Audio Applications
34. Emotion Recognition
● Speech emotion recognition using self-supervised features
○ A modular End-to-End SER system based on an Upstream + Downstream architecture paradigm,
which allows easy use/integration of a large variety of self-supervised features.
● Memobert: Pre-training model with prompt-based learning for multimodal
emotion recognition
○ learns multimodal joint representations through self-supervised learning
○ prompt-based method that reformulates emotion classification as a masked text prediction
● Multimodal Emotion Recognition with Surgical and Fabric Masks
○ investigate how muffled speech and occluded facial expressions change the prediction of
emotions
35. Speech as a Disease Biomarker
● Fraug: A Frame Rate Based Data Augmentation Method for Depression Detection
from Speech Signals
○ Among others, the speech signal is an important biomarker of our mental state and can be collected
remotely, in a non-invasive manner with no expert supervision.
○ Recently, speech-based automatic diagnosis of depression has gained significant momentum.
● Exploring Dementia Detection from Speech: Cross Corpus Analysis
○ Population aging is responsible for an increase of new Alzheimer’s disease (AD) cases, and creates the
need for scalable, cost-effective methods that are able to detect early stage AD.
○ Speech and language biomarkers are strong indicators of dementia, and provide a low-cost and
widespread alternative for the assessment of cognitive states.
● The Second Dicova Challenge: Dataset and Performance Analysis for Diagnosis of
Covid-19 Using Acoustics
○ Dataset of audio recordings consisting of breathing, cough and speech signals
○ Providing a point-of-care, rapid, easy to use, and cost-effective tool to help contain COVID-19 spread.
36. Voice Conversion
● Robust disentangled variational speech representation learning for zero-shot
voice conversion (Tencent)
○ Feeding an arbitrary speaker embedding and content embeddings to the VAE decoder
● Controllable Speech Representation Learning Via Voice Conversion and AIC
Loss (Adobe)
○ Its disentangled components (content, pitch, speaker identity, and energy) can be controlled
independently to alter the synthesis result.
● An Investigation of Streaming Non-Autoregressive sequence-to-sequence
Voice Conversion
● Voice Filter: Few-shot text-to-speech speaker adaptation using voice
conversion as a post-processing module (Amazon)
○ It uses voice conversion (VC) as a post-processing module appended to a pre-existing
high-quality TTS system, framing the few-shot TTS problem as a VC task.
37. Music Applications
● HiFi-SVC: Fast High Fidelity Cross-Domain Singing Voice Conversion
● Music Enhancement via Image Translation and Vocoding (Adobe)
● Source Separation By Steering Pretrained Music Models
● MELONS: generating melody with long-term structure using transformers and
structure graph
● Genre-Conditioned Long-Term 3D Dance Generation Driven by Music
● Deep Performer: Score-to-Audio Music Performance Synthesis (Dolby)
● SleepGAN: Towards Personalized Sleep Therapy Music (Nokia)
● Modeling beats and downbeats with a time-frequency Transformer (ByteDance)
38. Quantum Machine Learning
● Languages: Google Cirq / Microsoft Q# / IBM Qiskit
● Services: Google Quantum AI / Azure Quantum / IBM Quantum
● The dawn of quantum natural language processing
○ We successfully train a quantum-enhanced Long Short-Term Memory network to perform the
parts-of-speech tagging task via numerical simulations.
○ Practical applications are more likely to be a hybrid of classical and quantum operations. This
hybrid approach is not too different from what has been done in the past decade with GPUs.
○ The main idea behind Quantum Machine Learning (QML) is to replace parts of a neural network
(e.g. linear layers) with a quantum counterpart.
● Quantum federated learning with quantum data
○ Hybrid models fall short when dealing with the highly complex purely quantum data.
○ Thus, purely quantum ML models that can address these challenges were developed, such as
quantum neural networks (QNNs).
○ However, due to the fragile nature of the carriers of quantum data, i.e., qubits, there is a natural
need for distributed learning solutions such as federated learning (FL).
39. Machine Learning is All You Need
● Audio Representations
○ Learnable Wavelet Packet Transform for Data-Adapted Spectrograms
● Encodings
○ A Low-Parametric Model for Bit-Rate Estimation of VVC Residual Coding
○ Low-Complexity Multi-Model CNN in-Loop Filter for AVS3
● Digital Signal Processing
○ Learning Structured Sparsity For Time-Frequency Reconstruction
○ Learning Approach For Fast Approximate Matrix Factorizations
● Communication Systems
○ Adaptive Wireless Power Allocation with Graph Neural Networks
○ Deep Joint Source-Channel Coding for Wireless Image Transmission with Adaptive Rate Control
● Beamforming
○ Deep learning for location based beamforming with NLOS channels
○ Phase-Only Reconfigurable Sparse Array Beamforming Using Deep Learning
42. Joint Unsupervised and Supervised Training for Multilingual
ASR (Google)
● Most existing methods adopt a 2-stage scheme where
the self-supervised loss is optimized in the first
pretraining stage, and the standard supervised
fine-tuning resumes in the second stage.
● In this paper, we propose an end-to-end (E2E) Joint
Unsupervised and Supervised Training (JUST) method
to combine the supervised loss and the
self-supervised contrastive and masked language
modeling (MLM) losses.
● Spectrogram + Quantizer (wav2vec 2.0) + RNN-T
● Wins over XLSR-53!
43. Pseudo-Labeling for Massively Multilingual Speech
Recognition (Facebook)
● Prev works (from Facebook, similar
authors)
○ Iterative Pseudo-Labeling for Speech Recognition
(IPL) → LM + beam search to generate pseudo
labels
○ slimIPL: Language-model-free iterative
pseudo-labeling (slimIPL) → Use self-predictions
● Utilizing unlabeled data is helpful, even
with trivial methods.
44. Multilingual Text-To-Speech Training Using Cross Language
Voice Conversion And Self-Supervised Learning Of Speech
Representations (Facebook)
● It’s hard to find speakers who have
native proficiency in several
languages.
● Using HifiGAN-like model to
augment data (Synthetic generation
of target speaker speaking different
language)
45. A Configurable Multilingual Model is All You Need to
Recognize All Languages (Microsoft)
● Configurable multilingual model
(CMM) to recognize speech from
any combination of languages
based on a multi-hot LID vector
selected by users
● Language-specific vocabulary
strategy (making vocab smaller)
● Language-specific transformer
cell (one per language)
46. Zero-Shot Cross-Lingual Transfer Using Multi-Stream
Encoder and Efficient Speaker Representation (Tencent)
● Extract speaker embedding features that are
independent of both content information and
language identity.
● Multi-stream = Input text sequences are fed
into N-stream text encoders in parallel
● zero-shot cross-lingual transfer strategy =
fine-tune also with target-lingual data +
language-balanced sampling strategy
47. Tackling data scarcity in speech translation using zero-shot
multilingual machine translation techniques
● To tackle data scarcity, it is useful to make use of ASR and MT data for
end-to-end ST models. We explore techniques from zero-shot multilingual text
translation and apply them to speech side.
● Use tokens & augmentation
methods to make the model
decide output language based
on language tokens.
48. Multi-Lingual Multi-Task Speech Emotion Recognition Using
wav2vec 2.0
● Multi-task learning to increase emotion recognition performance
● Additional tasks
○ Gender Prediction (Ge)
○ Language Prediction (La)
○ F0 mean and standard deviation regression task (F0-me, F0-st)
○ Energy mean and standard deviation regression task (En–me, En-st)
○ Voice ratio regression task (Vr)
49. ADIMA: Abuse Detection In Multilingual Audio
● ADIMA, a novel, linguistically diverse, ethically sourced, expert annotated and
wellbalanced multilingual abuse detection audio dataset comprising of 11,775
audio samples in 10 Indic languages spanning 65 hours and spoken by 6,446
unique users.
50. SERAB: A multi-lingual benchmark for speech emotion
recognition
● Speech Emotion Recognition Adaptation Benchmark (SERAB), a framework for
evaluating the performance and generalization capacity of different approaches
for utterance-level SER.
52. Why still keyword spotting?
● For many voice-enabled platforms, queries follow a highly Zipfian distribution.
On the Comcast X1 entertainment system, for example, the top-20 commands
constitute around 30% of the traffic.
● Using an ASR system is excessive for targeting phonetically distinct commands
with a small vocabulary.
● Audio-only based wake word spotting (WWS), a special case of KWS, is
challenging under noisy conditions due to the environmental interference.
53. Temporal early exiting for streaming speech commands
recognition (Comcast)
● Additionally add prediction heads, stop inference mid-way based on entropy.
54. A Study of Designing Compact Audio-Visual Wake Word
Spotting System Based on Iterative Fine-Tuning in Neural
Network Pruning
● Audio-visual keyword
spotting
● Using both is helpful
55. Text Adaptive Detection for Customizable Keyword Spotting
● Novel text adaptive detection
framework to directly formulate
KWS as a detection rather than a
classification problem
● Text prompt is used as input, i.e.,
customizable wake words
56. Joint Ego-Noise Suppression and Keyword Spotting on
Sweeping Robots (Alibaba)
● a novel approach for joint ego-noise (self-created noise) suppression and
keyword detection
● Small footprint keyword spotting (KWS) on sweeping robot, i.e., the
conversation triggering module of the audio interface
● A circular microphone array of M = 6 → Multiple minimum variance
distortionless response (MVDR) beamformers
● If the keyword is present, noise adaptation will be slowed down to prevent
keyword speech being cancelled.
57. Unified Speculation, Detection, and Verification Keyword
Spotting (Alexa)
● Speculation → early decision (giving a head start, reduce system latency)
● Detection → keyword trigger task, more accurate decision
● Verification → verifies previous decision (correct mistakes)
● The proposed latency-aware max-pooling loss can control latency accuracy
trade-off effectively.
59. An Adapter Based Pre-Training for Efficient and Scalable
Self-Supervised Speech Representation Learning (Huawei)
Apply adapters (B) to original w2v2 (A) to combat language forgetting.
https://www.notion.so/hpcnt/An-Adapter-Based-Pre-Training-for-Efficient-and-Scalable-Self-Supervised-Speech-Representation-Learn-004
6747a578d4899b914e520959e01e8
60. Efficient Adapter Transfer of Self-Supervised Speech
Models for Automatic Speech Recognition (Huawei)
● Fine-tune on ASR task
● Apply adapters
61. Large-scale ASR Domain Adaptation by Self-and
Semi-supervised Learning (Google)
● Joint training with both RNN-T &
Self-supervised loss (wav2vec 2.0)
● Confidence Estimation Module (CEM)
→ To filter out low confidence samples in
pseudo-labels for Noisy student training
○ binary cross entropy between the estimated
confidence p and the binary target sequence c
● It utilizes Wav2vec2.0 loss on the causal
encoder, so there is no transition gap from
non-causal to causal.
62. Learning Domain-Invariant Transformation for Speaker
Verification
● Meta-learning to generate domain-invariant embeddings without pre-training
and fine-tuning
● Use both metric loss & classification loss together
63. Magic dust for cross-lingual adaptation of monolingual
wav2vec-2.0
● Monolingual wav2vec-2.0 is a good few-shot ASR learner in several languages
○ English → 8 Target languages
○ Performance up to 86% compared to XLSR
○ ASR Fine-Tuning on English hurts other languages
● Monolingual wav2vec2 model pre-trained on a high-resource language using
moderately-sized unlabeled data and small-sized labeled data in the target
language yields similar performance at XLSR
● Dropout Uncertainty-Driven Self-Training (DUST)
○ Leverages unlabeled data by pseudo-labeling (semi-supervised)
○ Student from a previous round becomes the teacher for the next round
65. Filteraugment: An acoustic environmental data
augmentation method
● FilterAugment mimics acoustic filters by applying different weights on
frequency bands, therefore enables model to extract relevant information from
wider frequency region.
● Improved version of frequency
masking which masks information
on random frequency bands.
66. Auditory-Based Data Augmentation for end-to-end
Automatic Speech Recognition
● Spectral smearing smooths the
speech spectrum and suppresses
details by broadening the
bandwidths of the auditory filters.
● Loudness recruitment compresses
amplitudes of different frequency
bands, simulates damaged ear.
67. Intermix: An Interference-Based Data Augmentation and
Regularization Technique for Automatic Deep Sound
Classification
● Prev work: BC learning
○ Taking sound energy into account
● Prev work: SpeechMix
○ Similar to manifold mixup,
mix intermediate representations
● This work: InterMix
○ Also apply phase shifts to inputs
& use it when mixing
68. Robust Speaker Verification Using Population-Based Data
Augmentation
● A population-based searching strategy for optimizing the augmentation
parameters.
● Instead of finding a fixed set of hyper-parameters, PBA learns a scheduler for
setting the hyper-parameters.
● List of augmentation used
○ Reverberation: Convolve with room impulse response (RIR)
○ Music: Music from a randomly selected MUSAN
○ Noise: Noise from MUSAN is added
○ Babble: Babble noise is added
○ Frequency masking
○ Time masking
69. Various augmentations
● LPC Augment: an LPC-based ASR Data Augmentation Algorithm for Low and
Zero-Resource Children's Dialects
○ The data augmentation procedure consists of perturbing the formant peaks of the Linear
predictive coding (LPC) spectrum during LPC analysis and reconstruction.
○ Compared with SpegAug & Speed perturbation. Did not show absolute advantage.
● ImportantAug: A Data Augmentation Agent for Speech
○ Adding noise to unimportant regions of the speech and not to important regions.
○ Importance is predicted for each utterance by a data augmentation agent that is trained to
maximize the amount of noise it adds while minimizing its impact on recognition performance.
● Fraug: A Frame Rate Based Data Augmentation Method for Depression
Detection from Speech Signals
○ Changing the frame-width and the frame-shift parameters during the feature extraction process
70. Task-specific Augmentations
● Cross-speaker style transfer for text-to-speech using data augmentation
○ Cross-speaker style transfer for TTS using data augmentation via voice conversion
● Spatial mixup: Directional loudness modification as data augmentation for
sound event localization and detection
○ application of parametric spatial audio effects for data augmentation, which modifies the
directional properties of a multi-channel spatial audio signal encoded in the ambisonics domain.
● Spatial Data Augmentation with Simulated Room Impulse Responses for Sound
Event Localization and Detection
○ Augments spatial characteristics using simulated room impulse responses (RIR). simulated RIRs
are convolved with the source signals to obtain an augmented multi-channel training dataset.
● Distribution augmentation for low-resource expressive text-to-speech
○ Data augmentation through word permutations & Constituency parse based tree substitutions
72. Federated learning challenges and opportunities: An
outlook (Amazon Alexa)
● Finding the lower limit of the number of communication rounds
○ Many local updates (for communication efficiency) can still converge to a desirable model.
○ Overly aggressive local updates will harm the performance due to the data heterogeneity
● Constraints
○ Memory constraint (each on-device model needs to be small in size)
○ Computation constraint (devices may perform only a limited number of gradient updates)
● Personalized FL
○ Conventional FL trains one model, personalized FL maintains a collection of client-specific
models
○ Will reduce test errors beyond what is possible with a single global model.
● Challenges of Lifelong FL
○ Online updates with single-pass data
○ Coupling of model training and data generation.
● Challenges on data
○ Data polarity (collected data does not represent the whole data distribution)
○ Data dependency (data are collected from time series with inevitable dependency)
73. Learnings from Federated Learning in the Real world (Alexa)
● Skewness: “heavy devices” with large amounts of data while there are many
“light users” with only a handful of data points.
● Non-uniform device selection outperforms uniform sampling of FL where it
utilizes the number of input points per device.
● We compare one-shot FL (Uses full range of data, single training) with continual
FL (Avoid storing data, multiple training rounds). We show that continual FL
outperforms the one-shot strategy in some setting, and is overall most
beneficial for heavy devices.
74. Enabling on-device training of speech recognition models
with federated dropout (Google)
● Communication/computation costs are strongly correlated with the size of the
model being trained. We propose using federated dropout to reduce the size of
client models while training a full-size model server-side.
● Furthermore, we find that federated dropout makes
smaller sub-models to have lower WER, making it
easier to dynamically adjust the model size.
● We use a realistic setting for federated training
of ASR models, wherein a well trained server-side
model is adapted to a new domain with FL on
edge devices.
75. Federated Self-supervised Learning
● Federated Self-Training for Data-Efficient Audio Recognition (Philips Research)
○ Self-training approach to exploit large-scale on-device unlabeled data to improve the
generalization of audio recognition models
○ Generate pseudo labels & train with softened labels
● Federated Self-Supervised Learning for Acoustic Event Classification (Amazon)
○ Applying FL to improve acoustic event classification (AEC) performance while no customer data
can be directly uploaded to the server
○ No pseudo labels (Common in AEC)
○ Solve the task of predicting the future
audio frame via feature representation