Intro to Auto Speech Recognition -- How ML Learns Speech-to-TextYoshiyuki Igarashi
Automatic speech recognition (ASR) is the technology that converts speech to written text. There are two main approaches: static systems that use acoustic, pronunciation, and language models sequentially; and end-to-end neural networks that use deep neural networks for feature extraction, acoustic modeling, and language modeling. Challenges for ASR systems include noise, variations in accents and ages, transferring learning across dialects, and operating locally on devices without internet.
- Text-to-speech software was first created in Japan in the late 1960s and came to America in 1973 when Bell Atlantic developed their own version. It allowed printed text to be read aloud.
- Since the 1980s, development has focused on expanding what can be read and making the voices sound more natural. In recent years, the goal has been to make the voices as human-like as possible.
- Text-to-speech software can be used to read any text on a computer for students with disabilities or language barriers as an inclusion tool. It allows for adjustment of reading speed and voice used.
This document discusses natural language processing (NLP) and feature extraction. It explains that NLP can be used for applications like search, translation, and question answering. The document then discusses extracting features from text like paragraphs, sentences, words, parts of speech, entities, sentiment, topics, and assertions. Specific features discussed in more detail include frequency, relationships between words, language features, supervised machine learning, classifiers, encoding words, word vectors, and parse trees. Tools mentioned for NLP include Google Cloud NLP, Spacy, OpenNLP, and Stanford Core NLP.
Active learning is a process that guides the sampling of training data by querying for certain types of instances based on previously seen data. It is important because labeling training data is time-consuming and expensive, so using active learning can reduce the amount of training data needed. Active learning has applications in text classification, web page classification, and junk mail recognition. The core problem in active learning is how to select the most informative training points to help the model, as opposed to random selection in passive learning.
The document discusses generative models and their applications in artificial intelligence. Generative adversarial networks (GANs) use two neural networks, a generator and discriminator, that compete against each other. The generator learns to generate new data that looks real by fooling the discriminator, while the discriminator learns to better identify real from fake data. GANs have been used for tasks like image generation and neural style transfer. They show potential to generate art, music and other creative forms through machine learning.
This document provides an overview of autoencoders and their use in unsupervised learning for deep neural networks. It discusses the history and development of neural networks, including early work in the 1940s-1980s and more recent advances in deep learning. It then explains how autoencoders work by setting the target values equal to the inputs, describes variants like denoising autoencoders, and how stacking autoencoders can create deep architectures for tasks like document retrieval, facial recognition, and signal denoising.
An optimized scientific workflow scheduling in cloud computingDIGVIJAY SHINDE
The document discusses optimizing scientific workflow scheduling in cloud computing. It begins with definitions of workflow and cloud computing. Workflow is a group of repeatable dependent tasks, while cloud computing provides applications and hardware resources over the Internet. There are three cloud service models: SaaS, PaaS, and IaaS. The document explores how to efficiently schedule workflows in the cloud to reduce makespan, cost, and energy consumption. It reviews different scheduling algorithms like FCFS, genetic algorithms, and discusses optimizing objectives like time and cost. The document provides a literature review comparing various workflow scheduling methods and algorithms. It concludes with discussing open issues and directions for future work in optimizing workflow scheduling for cloud computing.
What is the Expectation Maximization (EM) Algorithm?Kazuki Yoshida
Review of Do and Batzoglou. "What is the expectation maximization algorith?" Nat. Biotechnol. 2008;26:897. Also covers the Data Augmentation and Stan implementation. Resources at https://github.com/kaz-yos/em_da_repo
Intro to Auto Speech Recognition -- How ML Learns Speech-to-TextYoshiyuki Igarashi
Automatic speech recognition (ASR) is the technology that converts speech to written text. There are two main approaches: static systems that use acoustic, pronunciation, and language models sequentially; and end-to-end neural networks that use deep neural networks for feature extraction, acoustic modeling, and language modeling. Challenges for ASR systems include noise, variations in accents and ages, transferring learning across dialects, and operating locally on devices without internet.
- Text-to-speech software was first created in Japan in the late 1960s and came to America in 1973 when Bell Atlantic developed their own version. It allowed printed text to be read aloud.
- Since the 1980s, development has focused on expanding what can be read and making the voices sound more natural. In recent years, the goal has been to make the voices as human-like as possible.
- Text-to-speech software can be used to read any text on a computer for students with disabilities or language barriers as an inclusion tool. It allows for adjustment of reading speed and voice used.
This document discusses natural language processing (NLP) and feature extraction. It explains that NLP can be used for applications like search, translation, and question answering. The document then discusses extracting features from text like paragraphs, sentences, words, parts of speech, entities, sentiment, topics, and assertions. Specific features discussed in more detail include frequency, relationships between words, language features, supervised machine learning, classifiers, encoding words, word vectors, and parse trees. Tools mentioned for NLP include Google Cloud NLP, Spacy, OpenNLP, and Stanford Core NLP.
Active learning is a process that guides the sampling of training data by querying for certain types of instances based on previously seen data. It is important because labeling training data is time-consuming and expensive, so using active learning can reduce the amount of training data needed. Active learning has applications in text classification, web page classification, and junk mail recognition. The core problem in active learning is how to select the most informative training points to help the model, as opposed to random selection in passive learning.
The document discusses generative models and their applications in artificial intelligence. Generative adversarial networks (GANs) use two neural networks, a generator and discriminator, that compete against each other. The generator learns to generate new data that looks real by fooling the discriminator, while the discriminator learns to better identify real from fake data. GANs have been used for tasks like image generation and neural style transfer. They show potential to generate art, music and other creative forms through machine learning.
This document provides an overview of autoencoders and their use in unsupervised learning for deep neural networks. It discusses the history and development of neural networks, including early work in the 1940s-1980s and more recent advances in deep learning. It then explains how autoencoders work by setting the target values equal to the inputs, describes variants like denoising autoencoders, and how stacking autoencoders can create deep architectures for tasks like document retrieval, facial recognition, and signal denoising.
An optimized scientific workflow scheduling in cloud computingDIGVIJAY SHINDE
The document discusses optimizing scientific workflow scheduling in cloud computing. It begins with definitions of workflow and cloud computing. Workflow is a group of repeatable dependent tasks, while cloud computing provides applications and hardware resources over the Internet. There are three cloud service models: SaaS, PaaS, and IaaS. The document explores how to efficiently schedule workflows in the cloud to reduce makespan, cost, and energy consumption. It reviews different scheduling algorithms like FCFS, genetic algorithms, and discusses optimizing objectives like time and cost. The document provides a literature review comparing various workflow scheduling methods and algorithms. It concludes with discussing open issues and directions for future work in optimizing workflow scheduling for cloud computing.
What is the Expectation Maximization (EM) Algorithm?Kazuki Yoshida
Review of Do and Batzoglou. "What is the expectation maximization algorith?" Nat. Biotechnol. 2008;26:897. Also covers the Data Augmentation and Stan implementation. Resources at https://github.com/kaz-yos/em_da_repo
This document provides an overview of deep learning, including definitions of AI, machine learning, and deep learning. It discusses neural network models like artificial neural networks, convolutional neural networks, and recurrent neural networks. The document explains key concepts in deep learning like activation functions, pooling techniques, and the inception model. It provides steps for fitting a deep learning model, including loading data, defining the model architecture, adding layers and functions, compiling, and fitting the model. Examples and visualizations are included to demonstrate how neural networks work.
This document provides an overview of natural language processing (NLP) research trends presented at ACL 2020, including shifting away from large labeled datasets towards unsupervised and data augmentation techniques. It discusses the resurgence of retrieval models combined with language models, the focus on explainable NLP models, and reflections on current achievements and limitations in the field. Key papers on BERT and XLNet are summarized, outlining their main ideas and achievements in advancing the state-of-the-art on various NLP tasks.
This document discusses the installation and use of ANTLR, a parser generator tool. It begins with steps to install ANTLR and Java on a Windows system. It then provides an example of running ANTLR on a C++ grammar file to generate lexer and parser files in C#. The document explains that ANTLR automatically generates lexical and syntax analyzers from grammar files written in ANTLR's grammar language. It also discusses using the generated listener and visitor classes to traverse the parse tree.
This document contains information about a visual speech recognition project undertaken by M.Vignesh and S.Mahadevan, guided by Ms. Bhavani R. The project aims to extract lip features using neural networks and transcribe lip movements into text without using audio. It discusses the objectives to develop a speaker-independent system that can recognize and classify ten different words. The document also includes a literature review on existing audio-visual speech recognition methods and the proposed architecture for this project.
It is an introductory slide for pattern recolonization. This presentation was quit emotional for me because it was the last academic presentation in Green University of Bangladesh. For that i have used a sad emo at the first of the slide.
Text independent speaker recognition systemDeepesh Lekhak
This document outlines a project to develop a text-independent speaker recognition system. It lists the project members and provides an overview of the presentation sections, which include the system architecture, methodology, results and analysis, and applications. The methodology section describes implementing the system in MATLAB, including voice capturing, pre-processing, MFCC feature extraction, GMM matching, and identification/verification. It also outlines implementing the system on an FPGA, including analog conversion, storage, framing, FFT, mel spectrum, MFCC extraction, and UART transmission to MATLAB for further processing. The results show over 99% recognition accuracy with longer training and test data.
In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan used to generate a decision tree from a dataset. ID3 is the precursor to the C4. 5 algorithm, and is typically used in the machine learning and natural language processing domains.
An autoencoder is an artificial neural network that is trained to copy its input to its output. It consists of an encoder that compresses the input into a lower-dimensional latent-space encoding, and a decoder that reconstructs the output from this encoding. Autoencoders are useful for dimensionality reduction, feature learning, and generative modeling. When constrained by limiting the latent space or adding noise, autoencoders are forced to learn efficient representations of the input data. For example, a linear autoencoder trained with mean squared error performs principal component analysis.
Genetic algorithms are optimization techniques inspired by Darwin's theory of evolution. They use operations like selection, crossover and mutation to evolve solutions to problems by iteratively trying random variations. The document outlines the history, concepts, process and applications of genetic algorithms, including using them to optimize engineering design, routing, computer games and more. It describes how genetic algorithms encode potential solutions and use fitness functions to guide the evolution toward better outcomes.
This document summarizes Melanie Swan's presentation on deep learning. It began with defining key deep learning concepts and techniques, including neural networks, supervised vs. unsupervised learning, and convolutional neural networks. It then explained how deep learning works by using multiple processing layers to extract higher-level features from data and make predictions. Deep learning has various applications like image recognition and speech recognition. The presentation concluded by discussing how deep learning is inspired by concepts from physics and statistical mechanics.
The document discusses the BERT model for natural language processing. It begins with an introduction to BERT and how it achieved state-of-the-art results on 11 NLP tasks in 2018. The document then covers related work on language representation models including ELMo and GPT. It describes the key aspects of the BERT model, including its bidirectional Transformer architecture, pre-training using masked language modeling and next sentence prediction, and fine-tuning for downstream tasks. Experimental results are presented showing BERT outperforming previous models on the GLUE benchmark, SQuAD 1.1, SQuAD 2.0, and SWAG. Ablation studies examine the importance of the pre-training tasks and the effect of model size.
Build an LLM-powered application using LangChain.pdfAnastasiaSteele10
LangChain is an advanced framework that allows developers to create language model-powered applications. It provides a set of tools, components, and interfaces that make building LLM-based applications easier. With LangChain, managing interactions with language models, chaining together various components, and integrating resources like APIs and databases is a breeze. The platform includes a set of APIs that can be integrated into applications, allowing developers to add language processing capabilities without having to start from scratch.
This is a deep learning presentation based on Deep Neural Network. It reviews the deep learning concept, related works and specific application areas.It describes a use case scenario of deep learning and highlights the current trends and research issues of deep learning
Frontiers of Vision and Language: Bridging Images and Texts by Deep LearningYoshitaka Ushiku
Slide used on 11/11/2017 for the keynote in International Conference on Document Analysis and Recognition Workshop on Machine Learning.
(ICDAR WML 2017, https://icdarwml.wixsite.com/icdarwml2017)
This is a translated and updated version of https://www.slideshare.net/YoshitakaUshiku/deep-learning-73499744, which is written in Japanese.
This document summarizes a student presentation on a face recognition lecture attendance system. The system uses image processing and comparison to recognize students' faces from a high-definition camera feed and compare them to a database to take attendance. It is controlled by the faculty member, who instructs the system to start and end recording. The system is intended to smartly track that students remain for the entire lecture session and also function as surveillance. At the end, it reverts a full attendance report back to a central database. Diagrams including class, activity, sequence and use case diagrams are also presented to depict the system workflow and actors.
The document discusses AND/OR graphs, which are a type of graph or tree used to represent solutions to problems that can be decomposed into smaller subproblems. AND/OR graphs have nodes that represent goals or states, with successors labeled as either AND or OR branches. AND branches signify subgoals that must all be achieved to satisfy the parent goal, while OR branches indicate alternative subgoals that could achieve the parent goal. The graph helps model how decomposed subproblems relate and their solutions combine to solve the overall problem.
A Test Analysis Method for Black Box Testing Using AUT and Fault Knowledge.Tsuyoshi Yumoto
With a rapid increase in size and complexity of software today, the scope of software testing is also expanding. The efficiency of software testing needs to be improved in order to ensure the appropriate delivery deadline and cost of software development. For improving efficiency of software testing, the test needs to be designed in a way that the number of test cases is sufficient and appropriate in quantity. Test analysis is the activity to refine Application Under Test (AUT) into proper size that test design techniques can be applied to. It is for designing the test properly. However, the classification for proper size depends on individual’s own judgments. This paper proposes a test analysis method for the black box testing using a test category that is the classification based on fault and AUT knowledge.
This document summarizes a thesis on automating test routine creation through natural language processing. The author proposes using word embeddings and recommender systems to automatically generate test cases from requirements documents and link them together. The methodology involves representing text as word vectors, calculating similarity between requirements and test blocks, and applying association rule mining on test block sequences. An experiment on a space operations dataset showed the approach improved productivity in test creation and requirements tracing over manual methods. Future work could explore using deep learning models and collecting additional evaluation metrics from users.
This document provides an overview of deep learning, including definitions of AI, machine learning, and deep learning. It discusses neural network models like artificial neural networks, convolutional neural networks, and recurrent neural networks. The document explains key concepts in deep learning like activation functions, pooling techniques, and the inception model. It provides steps for fitting a deep learning model, including loading data, defining the model architecture, adding layers and functions, compiling, and fitting the model. Examples and visualizations are included to demonstrate how neural networks work.
This document provides an overview of natural language processing (NLP) research trends presented at ACL 2020, including shifting away from large labeled datasets towards unsupervised and data augmentation techniques. It discusses the resurgence of retrieval models combined with language models, the focus on explainable NLP models, and reflections on current achievements and limitations in the field. Key papers on BERT and XLNet are summarized, outlining their main ideas and achievements in advancing the state-of-the-art on various NLP tasks.
This document discusses the installation and use of ANTLR, a parser generator tool. It begins with steps to install ANTLR and Java on a Windows system. It then provides an example of running ANTLR on a C++ grammar file to generate lexer and parser files in C#. The document explains that ANTLR automatically generates lexical and syntax analyzers from grammar files written in ANTLR's grammar language. It also discusses using the generated listener and visitor classes to traverse the parse tree.
This document contains information about a visual speech recognition project undertaken by M.Vignesh and S.Mahadevan, guided by Ms. Bhavani R. The project aims to extract lip features using neural networks and transcribe lip movements into text without using audio. It discusses the objectives to develop a speaker-independent system that can recognize and classify ten different words. The document also includes a literature review on existing audio-visual speech recognition methods and the proposed architecture for this project.
It is an introductory slide for pattern recolonization. This presentation was quit emotional for me because it was the last academic presentation in Green University of Bangladesh. For that i have used a sad emo at the first of the slide.
Text independent speaker recognition systemDeepesh Lekhak
This document outlines a project to develop a text-independent speaker recognition system. It lists the project members and provides an overview of the presentation sections, which include the system architecture, methodology, results and analysis, and applications. The methodology section describes implementing the system in MATLAB, including voice capturing, pre-processing, MFCC feature extraction, GMM matching, and identification/verification. It also outlines implementing the system on an FPGA, including analog conversion, storage, framing, FFT, mel spectrum, MFCC extraction, and UART transmission to MATLAB for further processing. The results show over 99% recognition accuracy with longer training and test data.
In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan used to generate a decision tree from a dataset. ID3 is the precursor to the C4. 5 algorithm, and is typically used in the machine learning and natural language processing domains.
An autoencoder is an artificial neural network that is trained to copy its input to its output. It consists of an encoder that compresses the input into a lower-dimensional latent-space encoding, and a decoder that reconstructs the output from this encoding. Autoencoders are useful for dimensionality reduction, feature learning, and generative modeling. When constrained by limiting the latent space or adding noise, autoencoders are forced to learn efficient representations of the input data. For example, a linear autoencoder trained with mean squared error performs principal component analysis.
Genetic algorithms are optimization techniques inspired by Darwin's theory of evolution. They use operations like selection, crossover and mutation to evolve solutions to problems by iteratively trying random variations. The document outlines the history, concepts, process and applications of genetic algorithms, including using them to optimize engineering design, routing, computer games and more. It describes how genetic algorithms encode potential solutions and use fitness functions to guide the evolution toward better outcomes.
This document summarizes Melanie Swan's presentation on deep learning. It began with defining key deep learning concepts and techniques, including neural networks, supervised vs. unsupervised learning, and convolutional neural networks. It then explained how deep learning works by using multiple processing layers to extract higher-level features from data and make predictions. Deep learning has various applications like image recognition and speech recognition. The presentation concluded by discussing how deep learning is inspired by concepts from physics and statistical mechanics.
The document discusses the BERT model for natural language processing. It begins with an introduction to BERT and how it achieved state-of-the-art results on 11 NLP tasks in 2018. The document then covers related work on language representation models including ELMo and GPT. It describes the key aspects of the BERT model, including its bidirectional Transformer architecture, pre-training using masked language modeling and next sentence prediction, and fine-tuning for downstream tasks. Experimental results are presented showing BERT outperforming previous models on the GLUE benchmark, SQuAD 1.1, SQuAD 2.0, and SWAG. Ablation studies examine the importance of the pre-training tasks and the effect of model size.
Build an LLM-powered application using LangChain.pdfAnastasiaSteele10
LangChain is an advanced framework that allows developers to create language model-powered applications. It provides a set of tools, components, and interfaces that make building LLM-based applications easier. With LangChain, managing interactions with language models, chaining together various components, and integrating resources like APIs and databases is a breeze. The platform includes a set of APIs that can be integrated into applications, allowing developers to add language processing capabilities without having to start from scratch.
This is a deep learning presentation based on Deep Neural Network. It reviews the deep learning concept, related works and specific application areas.It describes a use case scenario of deep learning and highlights the current trends and research issues of deep learning
Frontiers of Vision and Language: Bridging Images and Texts by Deep LearningYoshitaka Ushiku
Slide used on 11/11/2017 for the keynote in International Conference on Document Analysis and Recognition Workshop on Machine Learning.
(ICDAR WML 2017, https://icdarwml.wixsite.com/icdarwml2017)
This is a translated and updated version of https://www.slideshare.net/YoshitakaUshiku/deep-learning-73499744, which is written in Japanese.
This document summarizes a student presentation on a face recognition lecture attendance system. The system uses image processing and comparison to recognize students' faces from a high-definition camera feed and compare them to a database to take attendance. It is controlled by the faculty member, who instructs the system to start and end recording. The system is intended to smartly track that students remain for the entire lecture session and also function as surveillance. At the end, it reverts a full attendance report back to a central database. Diagrams including class, activity, sequence and use case diagrams are also presented to depict the system workflow and actors.
The document discusses AND/OR graphs, which are a type of graph or tree used to represent solutions to problems that can be decomposed into smaller subproblems. AND/OR graphs have nodes that represent goals or states, with successors labeled as either AND or OR branches. AND branches signify subgoals that must all be achieved to satisfy the parent goal, while OR branches indicate alternative subgoals that could achieve the parent goal. The graph helps model how decomposed subproblems relate and their solutions combine to solve the overall problem.
A Test Analysis Method for Black Box Testing Using AUT and Fault Knowledge.Tsuyoshi Yumoto
With a rapid increase in size and complexity of software today, the scope of software testing is also expanding. The efficiency of software testing needs to be improved in order to ensure the appropriate delivery deadline and cost of software development. For improving efficiency of software testing, the test needs to be designed in a way that the number of test cases is sufficient and appropriate in quantity. Test analysis is the activity to refine Application Under Test (AUT) into proper size that test design techniques can be applied to. It is for designing the test properly. However, the classification for proper size depends on individual’s own judgments. This paper proposes a test analysis method for the black box testing using a test category that is the classification based on fault and AUT knowledge.
This document summarizes a thesis on automating test routine creation through natural language processing. The author proposes using word embeddings and recommender systems to automatically generate test cases from requirements documents and link them together. The methodology involves representing text as word vectors, calculating similarity between requirements and test blocks, and applying association rule mining on test block sequences. An experiment on a space operations dataset showed the approach improved productivity in test creation and requirements tracing over manual methods. Future work could explore using deep learning models and collecting additional evaluation metrics from users.
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"Boris Glavic
Managing fine-grained provenance is a critical requirement for data stream management systems (DSMS), not only to address complex applications that require diagnostic capabilities and assurance, but also for providing advanced functionality such as revision processing or query debugging. This paper introduces a novel approach that uses operator instrumentation, i.e., modifying the behavior of operators, to generate and propagate fine-grained provenance through several operators of a query network. In addition to applying this technique to compute provenance eagerly during query execution, we also study how to decouple provenance computation from query processing to reduce run-time overhead and avoid unnecessary provenance retrieval. This includes computing a concise superset of the provenance to allow lazily replaying a query network and reconstruct its provenance as well as lazy retrieval to avoid unnecessary reconstruction of provenance. We develop stream-specific compression methods to reduce the computational and storage overhead of provenance generation and retrieval. Ariadne, our provenance-aware extension of the Borealis DSMS implements these techniques. Our experiments confirm that Ariadne manages provenance with minor overhead and clearly outperforms query rewrite, the current state-of-the-art.
This document discusses VLSI testing and analysis. It defines key terms like defect, fault, and error and describes typical types of defects. It also discusses logical fault models and the role of testing in quality control. Different types of tests like production testing and burn-in testing are described. The testing process, fault simulation, design for testability techniques, and built-in self-test are summarized.
This document discusses cell verification metrics for a new microprocessor developed by Sony, Toshiba, and IBM. It outlines the verification process used, including a hierarchical approach from block to system level. Key metrics for verification are identified as effective passing rate, coverage, use of checkers, review completion, bug rate, and simulation cycles. Weights are assigned to different metrics and environments based on complexity. Overall verification progress is tracked using a weighted calculation of these metrics.
The document summarizes the cell verification process used by Sony, Toshiba, and IBM for a new high performance microprocessor. It provides an overview of the cell architecture and its multi-core, non-homogeneous design. It then describes the hierarchical verification process used from block to chip level. Key metrics like effective passing rate, coverage, checkers, and bug rate were used to measure verification progress, with different weights assigned to different verification environments based on their complexity.
Presentation for Interspeech 2022: Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials Sampling Strategy for SASVC 2022
Presenter: Chang Zeng (National Institute of Informatics and SOKENDAI)
Wed-SS-OS-6-5
Presentation video: https://youtu.be/gXxP1nn5X6E
The spoofing aware speaker verification challenge (SASVC) 2022 has been organized to explore the relation between automatic speaker verification (ASV) and spoof countermeasure (CM). In this paper, we will introduce our proposed spoofing- aware attention back-end developed for SASVC 2022. First, we design a novel sampling strategy for simulating real verification scenario. Then, in order to fully leverage information derived from multiple enrollments, a spoofing-aware attention back-end has been proposed. Finally, a joint decision strategy is aggregated to introduce mutual interaction between ASV module and CM module. Compared with the trial sampling method used in baseline systems, our proposed sampling method shows effective improvement without any attention modules. The experimental result shows our proposed spoofing-aware attention back-end improves the performance from 6.37% of best baseline system on evaluation dataset to 1.19% in term of SASV- EER (equal error rate) metric.
This document is a project report on implementing built-in self-test (BIST) for testing combinational circuits using an FPGA. It discusses linear feedback shift registers (LFSRs) that can generate pseudo-random test patterns, describes various combinational circuits that can serve as the circuit under test, and analyzes output response analyzers like multiple input signature registers (MISRs) that can compress the circuit's output responses. It also covers different types of read-only memory that can be used to store test patterns or results. Simulation results and schematic diagrams are provided to validate the BIST approach.
This document describes a case study of a staged migration from e to SystemVerilog at a company designing SERDES chips. It discusses advantages like reduced risk and training staff in small groups. Technical challenges addressed include coordinating simulation timelines and communicating between testbench parts in different languages. Solutions involved making SVTB the master, writing testcases as if fully converted, and using Verilog to pass info between languages. A proof of concept showed the converted approach. Supporting multiple simulators involved using a tool that connects a Pioneer-based SVTB to DUTs in other simulators to avoid lowest common denominator issues.
Improving Batch-Process Testing Techniques with a Domain-Specific LanguageDr. Spock
The document proposes using a domain-specific language (DSL) to improve testing of batch processes. It discusses challenges in batch process testing and principles for good test automation. The document then describes two case studies where DSLs were used to simplify test setup and writing for batch systems at a bank. An internal DSL using Selenium simplified visual testing, while an external DSL with Spring Remoting provided faster and more precise batch execution control. Both approaches made test automation easier but required effort to prepare isolated test environments.
1) The document describes experiments on visual search for musical performances and endoscopic videos using global and local features.
2) Two datasets were used - a musical performances dataset with 473 videos and an endoscopic videos dataset containing anonymized recordings from medical procedures.
3) Experiments evaluated different feature fusion methods for global features and the SIMPLE local feature descriptor, and involved both quantitative and qualitative testing on the datasets.
This document summarizes a keynote presentation on timing analysis and testing. It discusses several topics:
- Timing analysis techniques including worst-case execution time analysis, detailed architectural modeling, and the Chronos timing analysis tool.
- Cache analysis including identifying thrashing scenarios, instrumenting assertions, and using symbolic execution to generate tests that expose cache performance issues.
- Applications to multi-core timing analysis, analyzing cache side channels, and generating tests or attack scenarios rather than just worst-case execution bounds.
The document advocates leveraging advances in constraint solving and symbolic execution to develop additional timing analysis applications beyond traditional worst-case execution time analysis.
6 Top Tips to a Testing Strategy That WorksEggplant
Chuck Schneider, VP Application Platform Development & Testing Strategy at Cerner, shares his 6 top tips to ensure that your software testing strategy works the first time.
Chuck presented these slides at the Gartner Application Strategies & Solutions Summit 2019 in Las Vegas.
This presentation highlights the challenges encountered throughout Cerner's testing journey, and how you can avoid them!
Cerner is a world leading Healthcare Information Technology company working on securing better healthcare for the world.
Best Practices for Hyperparameter Tuning with MLflowDatabricks
Hyperparameter tuning and optimization is a powerful tool in the area of AutoML, for both traditional statistical learning models as well as for deep learning. There are many existing tools to help drive this process, including both blackbox and whitebox tuning. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, Bayesian optimization, and parzen estimators) and then discuss the open source tools which implement each of these techniques. Finally, we will discuss how we can leverage MLflow with these tools and techniques to analyze how our search is performing and to productionize the best models.
Speaker: Joseph Bradley
This document presents a selective Huffman coding scheme for efficient test vector compression. It begins with an introduction discussing the challenges of testing increasingly complex SOC designs. It then describes Huffman coding and the proposed selective Huffman coding approach, which only encodes the most frequently occurring test vectors. Experimental results show the selective Huffman scheme achieves nearly the same compression ratio as standard Huffman coding but with a simpler decoder requiring fewer logic elements.
The document discusses the development of benchmarks for the International Criticality Safety Benchmark Evaluation Project (ICSBEP). It describes the process for developing an ICSBEP benchmark, which includes describing the experiment, evaluating the data and uncertainties, specifying the benchmark model, providing sample calculations, and ensuring quality assurance. Benchmarks undergo rigorous internal and external review to be included in the ICSBEP Handbook, which contains over 500 evaluated experiments from 20 countries to validate nuclear data and computer codes.
Machine Translation is an emerging field of Computer Science. Researchers have been done to make Machine Translation systems for different language pairs using different practices including rule based machine translation and Statistical Machine Translation (SMT). The goal of the project is to design a Statistical Machine translator for software language localization using Moses decoder. The system is expected to automatically localize (translate) software contents from English into Tamil by using Statistical Machine Translation.
A large and growing amount of speech content in real-life scenarios is being recorded on consumer-grade devices in uncontrolled environments, resulting in degraded speech quality. Transforming such low-quality device-degraded speech into high-quality speech is a goal of speech enhancement (SE). This paper introduces a new speech dataset, DDS, to facilitate the research on SE. DDS provides aligned parallel recordings of high-quality speech (recorded in professional studios) and a number of versions of low-quality speech, producing approximately 2,000 hours speech data. The DDS dataset covers 27 realistic recording conditions by combining diverse acoustic environments and microphone devices, and each version of a condition consists of multiple recordings from six microphone positions to simulate different noise and reverberation levels. We also test several SE baseline systems on the DDS dataset and show the impact of recording diversity on performance.
Paper: https://arxiv.org/abs/2109.07931
Presentation for Interspeech 2022: "The VoiceMOS Challenge 2022"
Presenter: Dr. Erica Cooper, National Institute of Informatics
Preprint: https://arxiv.org/abs/2203.11389
Video: https://youtu.be/99ZQ-SLUvKE
Challenge website: https://voicemos-challenge-2022.github.io
Thu-SS-OS-9-5
We present the first edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthetic speech. This challenge drew 22 participating teams from academia and industry who tried a variety of approaches to tackle the problem of predicting human ratings of synthesized speech. The listening test data for the main track of the challenge consisted of samples from 187 different text-to-speech and voice conversion systems spanning over a decade of research, and the out-of-domain track consisted of data from more recent systems rated in a separate listening test. Results of the challenge show the effectiveness of fine-tuning self-supervised speech models for the MOS prediction task, as well as the difficulty of predicting MOS ratings for unseen speakers and listeners, and for unseen systems in the out-of-domain setting.
Presentation for Interspeech 2022: "Analyzing Language-Independent Speaker Anonymization Framework under Unseen Conditions"
Presenter: Dr. Xiaoxiao Miao, National Institute of Informatics
Thu-O-OS-9-1
Video: https://youtu.be/wVIxyLiQa1Y
Preprint: Preprint: https://arxiv.org/abs/2203.14834
In our previous work, we proposed a language-independent speaker anonymization system based on self-supervised learning models. Although the system can anonymize speech data of any language, the anonymization was imperfect, and the speech content of the anonymized speech was distorted. This limitation is more severe when the input speech is from a domain unseen in the training data. This study analyzed the bottleneck of the anonymization system under unseen conditions. It was found that the domain (e.g., language and channel) mismatch between the training and test data affected the neural waveform vocoder and anonymized speaker vectors, which limited the performance of the whole system. Increasing the training data diversity for the vocoder was found to be helpful to reduce its implicit language and channel dependency. Furthermore, a simple correlation-alignment-based domain adaption strategy was found to be significantly effective to alleviate the mismatch on the anonymized speaker vectors. Audio samples and source code are available online.
Presenter: Dr. Xiaoxiao Miao, NII
Paper: https://arxiv.org/abs/2202.13097
Speaker anonymization aims to protect the privacy of speakers while preserving spoken linguistic information from speech. Current mainstream neural network speaker anonymization systems are complicated, containing an F0 extractor, speaker encoder, automatic speech recognition acoustic model (ASR AM), speech synthesis acoustic model and speech waveform generation model. Moreover, as an ASR AM is language-dependent, trained on English data, it is hard to adapt it into another language. In this paper, we propose a simpler self-supervised learning (SSL)-based method for language-independent speaker anonymization without any explicit language-dependent model, which can be easily used for other languages. Extensive experiments were conducted on the VoicePrivacy Challenge 2020 datasets in English and AISHELL-3 datasets in Mandarin to demonstrate the effectiveness of our proposed SSL-based language-independent speaker anonymization method.
Presenter: Dr. Xin Wang, NII
Paper: https://arxiv.org/abs/2111.07725
Self-supervised speech model is a rapid progressing research topic, and many pre-trained models have been released and used in various down stream tasks. For speech anti-spoofing, most countermeasures (CMs) use signal processing algorithms to extract acoustic features for classification. In this study, we use pre-trained self-supervised speech models as the front end of spoofing CMs. We investigated different back end architectures to be combined with the self-supervised front end, the effectiveness of fine-tuning the front end, and the performance of using different pre-trained self-supervised models. Our findings showed that, when a good pre-trained front end was fine-tuned with either a shallow or a deep neural network-based back end on the ASVspoof 2019 logical access (LA) training set, the resulting CM not only achieved a low EER score on the 2019 LA test set but also significantly outperformed the baseline on the ASVspoof 2015, 2021 LA, and 2021 deepfake test sets. A sub-band analysis further demonstrated that the CM mainly used the information in a specific frequency band to discriminate the bona fide and spoofed trials across the test sets.
SSW11 presentation: How do Voices from Past Speech Synthesis Challenges Compare Today?
Presenter: Erica Cooper
Preprint: https://arxiv.org/abs/2105.02373
Presentation for SSW11: "Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance"
Presenter: Hieu-Thi Luong
Preprint: https://arxiv.org/abs/2106.13479
Tutorial on neural vocoders at the 2021 Speech Processing Courses in Crete, "Inclusive Neural Speech Synthesis."
Presenters: Xin Wang and Junichi Yamagishi, National Institute of Informatics, Japan
The document proposes a neural source-filter waveform model (NSF) for speech synthesis. The NSF model consists of three modules: a condition module that upsamples spectral features and F0, a source module that generates a sine excitation signal, and a filter module with dilated convolutional blocks. The model is trained directly on waveforms using a spectral distance criterion in the STFT domain. Experiments show the NSF model generates high quality waveforms comparable to WaveNet, with faster generation speed. Ablation tests analyze the importance of the sine excitation source and different spectral loss terms. The NSF provides a simpler alternative to autoregressive models for neural speech synthesis.
This document provides an overview of end-to-end text-to-speech synthesis models including Char2Wav, Tacotron, Tacotron2, and the development of a Japanese Tacotron model. It describes the typical encoder-decoder architecture with attention, improvements made in subsequent models like Tacotron2 using larger models and more regularization, and the implementation of a Japanese Tacotron to model Japanese pitch accents using an accent embedding and self-attention.
These are slides used for invited tutorial on "end-to-end text-to-speech synthesis", given at IEICE SP workshop held on 27th Jan 2019.
Part 1: Neural waveform modeling
Presenters: Xin Wang, Yusuke Yasuda (National Institute of Informatics, Japan)
This document discusses end-to-end text-to-speech synthesis models and summarizes several key models:
- Char2Wav was one of the earliest end-to-end models using an encoder-decoder with attention and a neural vocoder. It helped prove the concept but had limitations in target features and architecture.
- Tacotron improved upon Char2Wav with its CBHG encoder, attention mechanisms, and predicting mel spectrograms as targets. However, training was slow and waveform generation was limited.
- Tacotron2 achieved near-human quality by extending Tacotron and generating waveforms with WaveNet conditioned on predicted mel spectrograms.
The document also describes a Japanese Tacotron model that incorporates
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfTechgropse Pvt.Ltd.
In this blog post, we'll delve into the intersection of AI and app development in Saudi Arabia, focusing on the food delivery sector. We'll explore how AI is revolutionizing the way Saudi consumers order food, how restaurants manage their operations, and how delivery partners navigate the bustling streets of cities like Riyadh, Jeddah, and Dammam. Through real-world case studies, we'll showcase how leading Saudi food delivery apps are leveraging AI to redefine convenience, personalization, and efficiency.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances
1. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
Attention Back-end for Automatic
Speaker Verification with Multiple
Enrollment Utterances
Chang Zeng1,2, Xin Wang1, Erica Cooper1, Xiaoxiao Miao1, Junichi Yamagishi1,2
1National Institute of Informatics, Japan 2SOKENDAI, Japan
Monday, 9th May, 2022
1
2. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
Outline
• 1. Introduction
• 2. Conventional Back-end Models
• 3. Attention Back-end Model
• 3.1 Model architecture
• 3.2 Trials sampling method
• 3.3 Loss functions
• 4. Experimental Results
• 4.1 Datasets
• 4.2 Result and analysis
• 4.3 Ablation study
• 5. Conclusions
2
3. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
1. Introduction
• ASV framework
• Front-end speaker encoder
• TDNN, ECAPA-TDNN
• ResNet series
• Back-end model
• Cosine similarity
• Probabilistic linear discriminant analysis (PLDA)
• Neural PLDA (NPLDA) [1]
• Single enrollment VS Multiple enrollments
• The case of multiple enrollments is more robust in real world setting;
• Multiple enrollments contain more acoustic variations of enrolled speaker.
• Common operation for Multiple enrollments case
• Concatenate waveforms or acoustic features
• Average speaker embeddings
3
4. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
1. Introduction
• Can we use multiple enrollments more efficiently in a neural back-end?
• Learning intra-relationships among speaker embeddings of multiple enrollments;
• Aggregating a varying number of enrollment speaker embeddings by adaptive
weights.
4
5. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
Outline
• 1. Introduction
• 2. Conventional Back-end Models
• 3. Attention Back-end Model
• 3.1 Model architecture
• 3.2 Trials sampling method
• 3.3 Loss functions
• 4. Experimental Results
• 4.1 Datasets
• 4.2 Result and analysis
• 4.3 Ablation study
• 5. Conclusions
5
6. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
Cosine score
or PLDA
Speaker
encoder
Enc(·)
Speaker
encoder
Enc(·)
y1:T x
(1)
1:T x
(2)
1:T
x
(3)
1:T
q
�
Cosine score
or PLDA
Speaker
encoder
Enc(·)
Speaker
encoder
Enc(·)
Speaker
encoder
Enc(·)
Speaker
encoder
Enc(·)
y1:T x
(1)
1:T x
(2)
1:T x
(3)
1:T
q e1
e2 e3
Average
�
�
2. Conventional Back-end Models
• Multiple enrollments case, where enrollment speaker has an enrollment
set including 𝐾 utterances. Note 𝐾 is a varying number.
• Concatenating all enrollment utterances as a long waveforms
6
• Extracting speaker embedding from each utterance, then averaging all
speaker embeddings
Note: The scoring method of PLDA can be extended to
adapt the case of multiple enrollments in theoritically
except concatenation or average operation. As for the
detail, please refer to [2,3]
7. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
Outline
• 1. Introduction
• 2. Conventional Back-end Models
• 3. Attention Back-end Model
• 3.1 Model architecture
• 3.2 Trials sampling method
• 3.3 Loss functions
• 4. Experimental Results
• 4.1 Datasets
• 4.2 Result and analysis
• 4.3 Ablation study
• 5. Conclusions
7
8. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
3. Attention Back-end Model
• Model architecture
• Extract enrollment speaker embeddings
and testing speaker embedding 𝒒
8
Cos( , )
Speaker
encoder
Speaker
encoder
Speaker
encoder
Speaker
encoder
Logistic reg.
Scaled-dot
self-attention
Feed-forward
self-attention
• Score calibration
• Logistic regression (LR) for score calibration
• Exploring intra-relationships among all enrollment
speaker embeddings
• Scaled-dot self-attention (SDSA) [3]
• Stacking all enrollment speaker embeddings as
matrix 𝑬.
• Aggregating a varying number of enrollment speaker
embeddings by adaptive weights
• Feed-forward self-attention (FFSA) [4]
9. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
Negative pair: pairs marked by test trial=
, enroll=( ) of other speakers
included in a mini-batch.
3. Attention Back-end Model
• Trials sampling method for training
• Why: Introduce multiple enrollment process in
training stage
• How: Load speaker-balanced min-batch from
dataset
9
Rearrange the mini-batch to form trial pairs
(testing, enrollments)
Positive pair: test trial and enrollment data
are from the same speaker, one test trial is
selected from the speaker’s data, and the
rest are left for enrollment.
10. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
3. Attention Back-end Model
• Loss functions
• A weighted sum of binary cross-entropy (BCE) loss and generalized end-to-
end (GE2E) loss
• Binary cross-entropy loss
• Generalized end-to-end loss
10
Logistic reg.
(1-𝜆)*BCE+
𝜆 *GE2E
Label
11. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
Outline
• 1. Introduction
• 2. Conventional Back-end Models
• 3. Attention Back-end Model
• 3.1 Model architecture
• 3.2 Trials sampling method
• 3.3 Loss functions
• 4. Experimental Results
• 4.1 Datasets
• 4.2 Result and analysis
• 4.3 Ablation study
• 5. Conclusions
11
12. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
• Single enrollment case
VoxCeleb1&2 [5,6]
• Multiple enrollments case
CNCeleb1&2 [7,8]
4. Experimental Results
• Datasets
source:https://www.robots.ox.ac.uk/~vgg/data/voxceleb/
source:http://cnceleb.org/dataset
12
Note: The number of enrollment utterances of the
evaluation protocol in CNCeleb dataset is varying!
13. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
4. Experimental Results
• Speaker encoders
• TDNN with statistics pooling (TDNN)
• ResNet34
• TDNN with attentive statistics pooling (TDNN-ASP)
• ECAPA-TDNN
13
Speaker encoder without frame-level attention
Speaker encoder with frame-level attention
14. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
4. Experimental Results
• Result and analysis
• Single enrollment case on VoxCeleb
• Our back-end model is comparable with PLDA in term of EER, even if
there is no intra-relationship in enrollment utterance;
• As for minDCF, the proposed model slightly outperforms PLDA, which
proves that our model works reasonably well for the single enrollment
case.
14
• Multiple enrollments case on CNCeleb
• For Concat or Mean operation, we give the best performance of cosine similarity or PLDA model.
• Despite of the criterion, our proposed attention back-end realizes the best performance.
15. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
4. Experimental Results
• Ablation study
• From the second and the third lines of the table, two self-
attention mechanisms both contributed to the improvement.
• From the fourth line, we can also see that LR for calibration
was another essential component of the proposed model
• From a comparison of the fifth and sixth lines, we see that
the proposed model using BCE loss only had a lower EER than
that using GE2E loss only, while one using GE2E loss resulted
in a lower minDCF(0.001) value. Combining BCE and GE2E
thus resulted in the lowest values for all the metrics.
15
15
Cos( , )
Speaker
encoder
Speaker
encoder
Speaker
encoder
Speaker
encoder
Logistic reg.
Scaled-dot
self-attention
Feed-forward
self-attention
(1-𝜆)*BCE+
𝜆 *GE2E
Label
mean
16. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
Outline
• 1. Introduction
• 2. Conventional Back-end Models
• 3. Attention Back-end Model
• 3.1 Model architecture
• 3.2 Trials sampling method
• 3.3 Loss functions
• 4. Experimental Results
• 4.1 Datasets
• 4.2 Result and analysis
• 4.3 Ablation study
• 5. Conclusions
16
17. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
5. Conclusion
• Conventional back-end model like cosine similarity and PLDA cannot
work well in multiple enrollments scenario.
• Utterance-level attention mechanism can improve the performance
further, even if frame-level attention mechanism and spatial attention
mechanism have been applied in speaker encoder.
17
18. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
6. References
1) S. Ramoji, P. Krishnan, and S. Ganapathy, “Neural plda modeling for end-to-end speaker verification,” arXiv preprint arXiv:2008.04527, 2020.
2) Kong Aik Lee, Anthony Larcher, Chang Huai You, Bin Ma, and Haizhou Li, “Multi-session plda scoring of i-vector for partially open-set speaker detection,” in Interspeech, 2013.
3) Padmanabhan Rajan, Anton Afanasyev, Ville Hautam ̈aki, and Tomi Kinnunen, “From single to multiple enrollment i-vectors: Practical plda scoring variants for speaker verification,” Digital
Signal Processing, vol. 31, pp. 93–101, 2014.
4) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in NIPS, 2017.
5) Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio, “A structured self-attentive sentence embedding,” in ICLR, 2017.
6) A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in Interspeech, 2017.
7) J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in Interspeech, 2018.
8) Yue Fan, JW Kang, LT Li, KC Li, HL Chen, ST Cheng, PY Zhang, ZY Zhou, YQ Cai, and Dong Wang, “Cn-celeb: a challenging chinese speaker recognition dataset,” in 2020 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7604–7608.
9) Lantian Li, Ruiqi Liu, Jiawen Kang, Yue Fan, Hao Cui, Yunqi Cai, Ravichander Vipperla, Thomas Fang Zheng, and Dong Wang, “Cn-celeb: multi-genre speaker recognition,” arXiv preprint
arXiv:2012.12468, 2020.
18
19. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
Thanks!
19
Editor's Notes
Hello everyone, today I will give a report for our paper whose title is Attention back-end for automatic speaker verification with multiple enrollment utterances
First let's see this figure. It shows the general architecture of a speaker verification system including the front-end speaker encoder for extracting speaker embeddings and the back-end model for measuring pair-wise similarity.
The typical speaker encoders include time delay neural network and its variant called ECAPA-TDNN. ResNet is another choice for speaker encoder.
As for the back-end model, cosine similarity is the simplest one, which is utilized to measure the similarity between enrollment and testing utterances. The other is Probabilistic linear discriminant analysis (PLDA) which can alleviate the channel mismatch problem. Inspired by PLDA, neural PLDA model was proposed to use neural network to simulate the behavior of PLDA.
Previous study almost focused on single enrollment case. There is little literature shows how to handle multiple enrollments. However,
There are two common operations for multiple enrollment case
First, concatenating waveforms or acoustic feature of enrollment utterances
The other is averaging speaker embeddings of enrollment utterances.
Let's see the multiple enrollments case. Suppose a enrolled speaker has K utterances. Note K is a varying number.
We can concatenate all utterances as a long utterance as the figure shows. Under this operation, the multiple enrollment case is switched to single enrollment case.
Alternatively, we can also extract speaker embedding for each utterances, then average these embeddings to get a center vector to represent this speaker. In this way, the following process is same with single enrollment case.
No matter which method we use, the multiple enrollment case is switched to single enrollment case and we can use cosine score or PLDA model to measure the similarity between enrollment speaker and testing speaker.
However, there is no interaction among enrollment utterances.
First let's see the model architecture.
Same with previous description, we also extract speaker embedding for each enrollment utterance. Next, these embeddings are stacked as matrix denoted as capital E.
Then scaled-dot self attention is used to compute the intra-relationships among these embeddings. We believe in this way some latent factor would be emphasized and some latent factor would be de-emphasized.
The output matrix of scaled-dot self-attention will be aggregated by the following feed-forward attention module and output a vector denoted as h to represent the enrolled speaker.
Finally, we compute the cosine similarity between h and test speaker embedding q. A logistic regression is used here for score calibration.
Next I will introduce our trials sampling method.
In this method, we can introduce multiple enrollment process in training stage and the interactions among embeddings can update parameters of neural network by backward propagation.
OK I am going to introduce this method step by step.
First, we load a speaker-balanced mini-batch from dataset, which means the number of embeddings for each speaker in one mini-batch are same. For example, in the right table, there are 3 speakers A, B, C and each speaker has 4 utterances whose index range from 1 to 4.
Then, the data in this mini-batch will be re-arranged to form (testing, enrollments) trial pairs.
For positive pair, testing embedding and enrollment data come from same speaker, when we select one embedding as testing embedding, the rest of this same speaker is used as enrollment data.
For negative pair, testing embedding is marked by the check symbol and enrollment data is marked by three cross symbol of other speakers included in this mini-batch.
This page shows the datasets we used in our experiments.
The left are voxceleb1 & 2, which contain more than 7000 speakers and more than 2000 hours speech data. All utterances come from the Youtube video of celebrities in English.
The right are cnceleb1 & 2, which are Chinese datasets including more than 1200 hours speech data. Different from voxceleb datasets only including interview data, cnceleb has 11 different genres including some hard scenarios like singing and entertainment.
There are several variants of PLDA model. Here we only consider the widely used Gaussian PLDA model.
Epsilon
Although there are some literature like the reference 1 & 2, which has adapted scoring method of PLDA model to multi-sessions enrollment, the reported performance was not ideal even compared with simple average or concatenation.