These are slides used for invited tutorial on "end-to-end text-to-speech synthesis", given at IEICE SP workshop held on 27th Jan 2019.
Part 2 – Tactron and related end-to-end systems
Presenters: Xin Wang, Yusuke Yasuda (National Institute of Informatics, Japan)
Tutorial on neural vocoders at the 2021 Speech Processing Courses in Crete, "Inclusive Neural Speech Synthesis."
Presenters: Xin Wang and Junichi Yamagishi, National Institute of Informatics, Japan
Presentation for SSW11: "Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance"
Presenter: Hieu-Thi Luong
Preprint: https://arxiv.org/abs/2106.13479
These are the slides for the presentation titled "Neural source-waveform model," given at ICASSP 2019 in Brighton, UK.
Presenter: Xin Wang, National Institute of Informatics, Japan
These are slides used for invited tutorial on "end-to-end text-to-speech synthesis", given at IEICE SP workshop held on 27th Jan 2019.
Part 2 – Tactron and related end-to-end systems
Presenters: Xin Wang, Yusuke Yasuda (National Institute of Informatics, Japan)
Tutorial on neural vocoders at the 2021 Speech Processing Courses in Crete, "Inclusive Neural Speech Synthesis."
Presenters: Xin Wang and Junichi Yamagishi, National Institute of Informatics, Japan
Presentation for SSW11: "Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance"
Presenter: Hieu-Thi Luong
Preprint: https://arxiv.org/abs/2106.13479
These are the slides for the presentation titled "Neural source-waveform model," given at ICASSP 2019 in Brighton, UK.
Presenter: Xin Wang, National Institute of Informatics, Japan
Deep Learning Based Voice Activity Detection and Speech EnhancementNAVER Engineering
발표자: 김준태 (KAIST 박사과정)
발표일: 2018.10
Voice activity detection (VAD) and speech enhancement (SE) are important front-end technologies for noise robust speech recognition system.
From incoming noisy signal, VAD detects the speech signal only and SE removes the noise signal while conserving the speech signal.
For VAD and SE, this presentation will cover the traditional methods, deep learning based methods, and our papers as follows:
1. J. Kim and M. Hahn, "Voice Activity Detection Using an Adaptive Context Attention Model," in IEEE Signal Processing Letters, vol. 25, no. 8, pp. 1181-1185, Aug. 2018.
2. J. Kim and M. Hahn, "Speech Enhancement Using a Two Step Network," submitted to IEEE Signal Processing Letters, 2018.
Also, this presentation will briefly introduce some experimental results in real-world environment (far-field, noisy environment), conducted on the embedded board.
For VAD,
Traditional VAD methods.
Deep learning based VAD methods.
Paper presentation: J. Kim and M. Hahn, "Voice Activity Detection Using an Adaptive Context Attention Model," in IEEE Signal Processing Letters, vol. 25, no. 8, pp. 1181-1185, Aug. 2018.
End point detection based on VAD.
Experimental results of DNN-EPD on embedded board in real-world environment.
For SE,
Traditional SE methods.
Deep learning based SE methods.
Paper presentation: J. Kim and M. Hahn, "Speech Enhancement Using a Two Step Network," submitted to IEEE Signal Processing Letters, 2018.
Experimental results in real-world environment.
Time-series forecasting of indoor temperature using pre-trained Deep Neural N...Francisco Zamora-Martinez
Artificial neural networks have proved to be good at time-series forecasting
problems, being widely studied at literature. Traditionally, shallow
architectures were used due to convergence problems when dealing with deep
models. Recent research findings enable deep architectures training, opening a
new interesting research area called deep learning. This paper presents a study
of deep learning techniques applied to time-series forecasting in a real indoor
temperature forecasting task, studying performance due to different
hyper-parameter configurations. When using deep models, better generalization
performance at test set and an over-fitting reduction has been observed.
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...a3labdsp
"Finite impulse response convolution is one of the most widely used operation in digital signal processing field for filtering operations. In this context, low computationally demanding techniques become essential for calculating convolutions with low input/output latency in real scenarios, considering that the real time requirements are strictly related to the impulse response length. In this context, an efficient DSP implementation of a fast convolution approach is presented with the aim of lowering the workload required in applications like reverberation. It is based on a non uniform partitioning of the impulse response and a psychoacoustic technique derived from the human ear sensitivity. Several results are reported in order to prove the effectiveness of the proposed approach also introducing comparisons with the existing techniques of the state of the art."
Resume of Gaurang Rathod, Embedded Software DeveloperGaurang Rathod
o 2.5+ years of experience in the embedded system domain
o Expertise in C language, OS concepts and ARM cortex M3/M4 architecture
o Strong Electronics engineering and research background
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Ne...Alpen-Adria-Universität
Recorded cataract surgery videos play a prominent role in training and investigating the surgery, and enhancing the surgical outcomes. Due to storage limitations in hospitals, however, the recorded cataract surgeries are deleted after a short time and this precious source of information cannot be fully utilized. Lowering the quality to reduce the required storage space is not advisable since the degraded visual quality results in the loss of relevant information that limits the usage of these videos. To address this problem, we propose a relevance-based compression technique consisting of two modules: (i) relevance detection, which uses neural networks for semantic segmentation and classification of the videos to detect relevant spatio-temporal information, and (ii) content-adaptive compression, which restricts the amount of distortion applied to the relevant content while allocating less bitrate to irrelevant content. The proposed relevance-based compression framework is implemented considering five scenarios based on the definition of relevant information from the target audience’s perspective. Experimental results demonstrate the capability of the proposed approach in relevance detection. We further show that the proposed approach can achieve high compression efficiency by abstracting substantial redundant information while retaining the high quality of the relevant content.
Digital Watermarking Applications and Techniques: A Brief ReviewEditor IJCATR
The frequent availability of digital data such as audio, images and videos became possible to the public through the expansion
of the internet. Digital watermarking technology is being adopted to ensure and facilitate data authentication, security and copyright
protection of digital media. It is considered as the most important technology in today’s world, to prevent illegal copying of data. Digital
watermarking can be applied to audio, video, text or images. This paper includes the detail study of watermarking definition and various
watermarking applications and techniques used to enhance data security.
Non autoregressive neural text-to-speech reviewJune-Woo Kim
Non autoregressive neural text-to-speech, Peng, Kainan, et al. "Non-autoregressive neural text-to-speech." International Conference on Machine Learning. PMLR, 2020. review by June-Woo Kim
Deep Learning Based Voice Activity Detection and Speech EnhancementNAVER Engineering
발표자: 김준태 (KAIST 박사과정)
발표일: 2018.10
Voice activity detection (VAD) and speech enhancement (SE) are important front-end technologies for noise robust speech recognition system.
From incoming noisy signal, VAD detects the speech signal only and SE removes the noise signal while conserving the speech signal.
For VAD and SE, this presentation will cover the traditional methods, deep learning based methods, and our papers as follows:
1. J. Kim and M. Hahn, "Voice Activity Detection Using an Adaptive Context Attention Model," in IEEE Signal Processing Letters, vol. 25, no. 8, pp. 1181-1185, Aug. 2018.
2. J. Kim and M. Hahn, "Speech Enhancement Using a Two Step Network," submitted to IEEE Signal Processing Letters, 2018.
Also, this presentation will briefly introduce some experimental results in real-world environment (far-field, noisy environment), conducted on the embedded board.
For VAD,
Traditional VAD methods.
Deep learning based VAD methods.
Paper presentation: J. Kim and M. Hahn, "Voice Activity Detection Using an Adaptive Context Attention Model," in IEEE Signal Processing Letters, vol. 25, no. 8, pp. 1181-1185, Aug. 2018.
End point detection based on VAD.
Experimental results of DNN-EPD on embedded board in real-world environment.
For SE,
Traditional SE methods.
Deep learning based SE methods.
Paper presentation: J. Kim and M. Hahn, "Speech Enhancement Using a Two Step Network," submitted to IEEE Signal Processing Letters, 2018.
Experimental results in real-world environment.
Time-series forecasting of indoor temperature using pre-trained Deep Neural N...Francisco Zamora-Martinez
Artificial neural networks have proved to be good at time-series forecasting
problems, being widely studied at literature. Traditionally, shallow
architectures were used due to convergence problems when dealing with deep
models. Recent research findings enable deep architectures training, opening a
new interesting research area called deep learning. This paper presents a study
of deep learning techniques applied to time-series forecasting in a real indoor
temperature forecasting task, studying performance due to different
hyper-parameter configurations. When using deep models, better generalization
performance at test set and an over-fitting reduction has been observed.
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...a3labdsp
"Finite impulse response convolution is one of the most widely used operation in digital signal processing field for filtering operations. In this context, low computationally demanding techniques become essential for calculating convolutions with low input/output latency in real scenarios, considering that the real time requirements are strictly related to the impulse response length. In this context, an efficient DSP implementation of a fast convolution approach is presented with the aim of lowering the workload required in applications like reverberation. It is based on a non uniform partitioning of the impulse response and a psychoacoustic technique derived from the human ear sensitivity. Several results are reported in order to prove the effectiveness of the proposed approach also introducing comparisons with the existing techniques of the state of the art."
Resume of Gaurang Rathod, Embedded Software DeveloperGaurang Rathod
o 2.5+ years of experience in the embedded system domain
o Expertise in C language, OS concepts and ARM cortex M3/M4 architecture
o Strong Electronics engineering and research background
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Ne...Alpen-Adria-Universität
Recorded cataract surgery videos play a prominent role in training and investigating the surgery, and enhancing the surgical outcomes. Due to storage limitations in hospitals, however, the recorded cataract surgeries are deleted after a short time and this precious source of information cannot be fully utilized. Lowering the quality to reduce the required storage space is not advisable since the degraded visual quality results in the loss of relevant information that limits the usage of these videos. To address this problem, we propose a relevance-based compression technique consisting of two modules: (i) relevance detection, which uses neural networks for semantic segmentation and classification of the videos to detect relevant spatio-temporal information, and (ii) content-adaptive compression, which restricts the amount of distortion applied to the relevant content while allocating less bitrate to irrelevant content. The proposed relevance-based compression framework is implemented considering five scenarios based on the definition of relevant information from the target audience’s perspective. Experimental results demonstrate the capability of the proposed approach in relevance detection. We further show that the proposed approach can achieve high compression efficiency by abstracting substantial redundant information while retaining the high quality of the relevant content.
Digital Watermarking Applications and Techniques: A Brief ReviewEditor IJCATR
The frequent availability of digital data such as audio, images and videos became possible to the public through the expansion
of the internet. Digital watermarking technology is being adopted to ensure and facilitate data authentication, security and copyright
protection of digital media. It is considered as the most important technology in today’s world, to prevent illegal copying of data. Digital
watermarking can be applied to audio, video, text or images. This paper includes the detail study of watermarking definition and various
watermarking applications and techniques used to enhance data security.
Non autoregressive neural text-to-speech reviewJune-Woo Kim
Non autoregressive neural text-to-speech, Peng, Kainan, et al. "Non-autoregressive neural text-to-speech." International Conference on Machine Learning. PMLR, 2020. review by June-Woo Kim
Parallel WaveGAN, Yamamoto, Ryuichi, Eunwoo Song, and Jae-Min Kim. "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram." ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020. review by June-Woo Kim
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Spark Summit
The talk will present a MPI-based extension of the Spark platform developed in the context of light source facilities. The background and rationale of this extension are described in the attached paper “Bringing the HPC reconstruction algorithms to Big Data platforms”[1], which has been presented at New York Scientific Data Summit (NYSDS), August 14-17, 2016 (talk: https://www.bnl.gov/nysds16/files/pdf/talks/NYSDS16%20Malitsky.pdf) Specifically, the paper highlighted a gap between two modern driving forces of the scientific discovery process: HPC and Big Data technologies. As a result, it proposed to extend the Spark platform with inter-worker communication for supporting scientific-oriented parallel applications. The approach was illustrated in the context of the Spark-based deployment of the SHARP MPI/GPU ptychographic solver. Aside from its practical value, this application represents a reference use case that captures the major technical aspects of other reconstruction tasks. In the NYSDS’16 paper, the implemented approach followed the CaffeOnSpark RDMA peer-to-peer model and augmented it with the RDMA address exchange server. By the Spark Summit, we plan to further advance this direction with the Spark-MPI generic solution based on the Hydra process management framework for supporting two major MPI implementations, MPICH and MVAPICH.
Meetup MLDD: Machine Learning Dresden, 8th May 2018
Signals from outer space
How NASA Benefits from Graph-Powered NLP
Vlasta Kus talked about the advantages of graph-based natural language processing (NLP) using a public NASA dataset as example. From his abstract: "[...] we are building a platform (from large part open-source) that integrates Neo4j and NLP (such as Named Entity Recognition, sentiment analysis, word embeddings, LDA topic extraction), and we test and develop further related features and tools, lately, for example, integrating Neo4j and Tensorflow for employing deep learning techniques (such as deep auto-encoders for automatic text summarisation)."
Vlasta holds a Ph.D. in Physics from the Charles University in Prague and has worked for SecureOps, as a freelance Data Scientist, and since 2017 as a Data Scientist at GraphAware (https://graphaware.com/), a London-based company that builds solutions around Neo4j.
Neural Networks, Spark MLlib, Deep LearningAsim Jalis
What are neural networks? How to use the neural networks algorithm in Apache Spark MLlib? What is Deep Learning? Presented at Data Science Meetup at Galvanize on 2/17/2016.
For code see IPython/Jupyter/Toree notebook at http://nbviewer.jupyter.org/gist/asimjalis/4f911882a1ab963859ce
Introduction to Next-Generation Sequencing (NGS) TechnologyQIAGEN
The continuous evolution of NGS technology has led to an enormous diversification in NGS applications and dramatically decreased the costs to sequence a complete human genome.
In this presentation, we will discuss the following major topics:
• Basic overview of NGS sequencing technologies
• Next-generation sequencing workflow
• Spectrum of NGS applications
• QIAGEN universal NGS solutions
Robust Feature Learning with Deep Neural Networks
http://snu-primo.hosted.exlibrisgroup.com/primo_library/libweb/action/display.do?tabs=viewOnlineTab&doc=82SNU_INST21557911060002591
NumPyCNNAndroid: A Library for Straightforward Implementation of Convolutiona...Ahmed Gad
The presentation of my paper titled "#NumPyCNNAndroid: A Library for Straightforward Implementation of #ConvolutionalNeuralNetworks for #Android Devices" at the second International Conference of Innovative Trends in #ComputerEngineering (ITCE 2019).
The paper proposes a library for implementing convolutional neural networks (CNNs) in order to run on Android devices. The process of running the CNN on the mobile devices is straightforward and does not require an in-between step for model conversion as it uses #Kivy cross-platform library.
The CNN layers are implemented in #NumPy. You can find their implementation in my #GitHub project at this link: https://github.com/ahmedfgad/NumPyCNN
The library is also open source available here: https://github.com/ahmedfgad/NumPyCNNAndroid
There are 2 modes of operation for this work. The first one is training the CNN on the mobile device but it is very time-consuming at least in the current version. The second and preferred way is to train the CNN in a desktop computer and then use it on the mobile device.
A large and growing amount of speech content in real-life scenarios is being recorded on consumer-grade devices in uncontrolled environments, resulting in degraded speech quality. Transforming such low-quality device-degraded speech into high-quality speech is a goal of speech enhancement (SE). This paper introduces a new speech dataset, DDS, to facilitate the research on SE. DDS provides aligned parallel recordings of high-quality speech (recorded in professional studios) and a number of versions of low-quality speech, producing approximately 2,000 hours speech data. The DDS dataset covers 27 realistic recording conditions by combining diverse acoustic environments and microphone devices, and each version of a condition consists of multiple recordings from six microphone positions to simulate different noise and reverberation levels. We also test several SE baseline systems on the DDS dataset and show the impact of recording diversity on performance.
Paper: https://arxiv.org/abs/2109.07931
Presentation for Interspeech 2022: "The VoiceMOS Challenge 2022"
Presenter: Dr. Erica Cooper, National Institute of Informatics
Preprint: https://arxiv.org/abs/2203.11389
Video: https://youtu.be/99ZQ-SLUvKE
Challenge website: https://voicemos-challenge-2022.github.io
Thu-SS-OS-9-5
We present the first edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthetic speech. This challenge drew 22 participating teams from academia and industry who tried a variety of approaches to tackle the problem of predicting human ratings of synthesized speech. The listening test data for the main track of the challenge consisted of samples from 187 different text-to-speech and voice conversion systems spanning over a decade of research, and the out-of-domain track consisted of data from more recent systems rated in a separate listening test. Results of the challenge show the effectiveness of fine-tuning self-supervised speech models for the MOS prediction task, as well as the difficulty of predicting MOS ratings for unseen speakers and listeners, and for unseen systems in the out-of-domain setting.
Presentation for Interspeech 2022: "Analyzing Language-Independent Speaker Anonymization Framework under Unseen Conditions"
Presenter: Dr. Xiaoxiao Miao, National Institute of Informatics
Thu-O-OS-9-1
Video: https://youtu.be/wVIxyLiQa1Y
Preprint: Preprint: https://arxiv.org/abs/2203.14834
In our previous work, we proposed a language-independent speaker anonymization system based on self-supervised learning models. Although the system can anonymize speech data of any language, the anonymization was imperfect, and the speech content of the anonymized speech was distorted. This limitation is more severe when the input speech is from a domain unseen in the training data. This study analyzed the bottleneck of the anonymization system under unseen conditions. It was found that the domain (e.g., language and channel) mismatch between the training and test data affected the neural waveform vocoder and anonymized speaker vectors, which limited the performance of the whole system. Increasing the training data diversity for the vocoder was found to be helpful to reduce its implicit language and channel dependency. Furthermore, a simple correlation-alignment-based domain adaption strategy was found to be significantly effective to alleviate the mismatch on the anonymized speaker vectors. Audio samples and source code are available online.
Presentation for Interspeech 2022: Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials Sampling Strategy for SASVC 2022
Presenter: Chang Zeng (National Institute of Informatics and SOKENDAI)
Wed-SS-OS-6-5
Presentation video: https://youtu.be/gXxP1nn5X6E
The spoofing aware speaker verification challenge (SASVC) 2022 has been organized to explore the relation between automatic speaker verification (ASV) and spoof countermeasure (CM). In this paper, we will introduce our proposed spoofing- aware attention back-end developed for SASVC 2022. First, we design a novel sampling strategy for simulating real verification scenario. Then, in order to fully leverage information derived from multiple enrollments, a spoofing-aware attention back-end has been proposed. Finally, a joint decision strategy is aggregated to introduce mutual interaction between ASV module and CM module. Compared with the trial sampling method used in baseline systems, our proposed sampling method shows effective improvement without any attention modules. The experimental result shows our proposed spoofing-aware attention back-end improves the performance from 6.37% of best baseline system on evaluation dataset to 1.19% in term of SASV- EER (equal error rate) metric.
Presenter: Dr. Xiaoxiao Miao, NII
Paper: https://arxiv.org/abs/2202.13097
Speaker anonymization aims to protect the privacy of speakers while preserving spoken linguistic information from speech. Current mainstream neural network speaker anonymization systems are complicated, containing an F0 extractor, speaker encoder, automatic speech recognition acoustic model (ASR AM), speech synthesis acoustic model and speech waveform generation model. Moreover, as an ASR AM is language-dependent, trained on English data, it is hard to adapt it into another language. In this paper, we propose a simpler self-supervised learning (SSL)-based method for language-independent speaker anonymization without any explicit language-dependent model, which can be easily used for other languages. Extensive experiments were conducted on the VoicePrivacy Challenge 2020 datasets in English and AISHELL-3 datasets in Mandarin to demonstrate the effectiveness of our proposed SSL-based language-independent speaker anonymization method.
Presenter: Dr. Xin Wang, NII
Paper: https://arxiv.org/abs/2111.07725
Self-supervised speech model is a rapid progressing research topic, and many pre-trained models have been released and used in various down stream tasks. For speech anti-spoofing, most countermeasures (CMs) use signal processing algorithms to extract acoustic features for classification. In this study, we use pre-trained self-supervised speech models as the front end of spoofing CMs. We investigated different back end architectures to be combined with the self-supervised front end, the effectiveness of fine-tuning the front end, and the performance of using different pre-trained self-supervised models. Our findings showed that, when a good pre-trained front end was fine-tuned with either a shallow or a deep neural network-based back end on the ASVspoof 2019 logical access (LA) training set, the resulting CM not only achieved a low EER score on the 2019 LA test set but also significantly outperformed the baseline on the ASVspoof 2015, 2021 LA, and 2021 deepfake test sets. A sub-band analysis further demonstrated that the CM mainly used the information in a specific frequency band to discriminate the bona fide and spoofed trials across the test sets.
SSW11 presentation: How do Voices from Past Speech Synthesis Challenges Compare Today?
Presenter: Erica Cooper
Preprint: https://arxiv.org/abs/2105.02373
These are slides used for invited tutorial on "end-to-end text-to-speech synthesis", given at IEICE SP workshop held on 27th Jan 2019.
Part 1: Neural waveform modeling
Presenters: Xin Wang, Yusuke Yasuda (National Institute of Informatics, Japan)
More from Yamagishi Laboratory, National Institute of Informatics, Japan (12)
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Connector Corner: Automate dynamic content and events by pushing a button
Neural Waveform Modeling
1. Slides by Xin Wang
National Institute of Informatics
Copyright (c) 2018 - 2019
National Institute of Informatics
Department of Computer Science
Some rights reserved.
This work is licensed under the Creative Commons Attribution 3.0 license.
See http://creativecommons.org/ for details.
Note: Natural Japanese speech data belonging to ATR Ximera corpus are deleted
in this public available version
2. Neural Waveform Modeling
from our experiences in text-to-speech application
2contact: wangxin@nii.ac.jp
we welcome critical comments, suggestions, and discussion
Xin WANG with Shinji Takaki and Junichi Yamagishi
National Institute of Informatics, Japan
NLP lecture series, IIS
Erlangen Germany, 2019
3. Postdoc, Yamagishi-lab, NII
Research keywords:
• Text-to-speech synthesis (TTS)
1. Neural network
2. Hidden Markov model
• Speech anti-spoofing
SELF-INTRODUCTION
3
WANG Xin
Pronunciation one shin
☛Research-map page: https://researchmap.jp/wangxin/?lang=english
☛Personal page: http://tonywangx.github.io
鑫王
6. Text TTS Speech waveform
Text-to-speech synthesis
Statistical parametric speech synthesis 1
6
Marianna
made the
marmalade
Linguistic features Acoustic features
Front-end
(Text-analyzer)
Back-end
Waveform
generator
Acoustic
models
Text
/m/ /ɛ/ /r/ …
H* on Marianna …
(S (NP (N Marianna))
(VP (V made)
(NP (ART the))
(N marmalade))))
Mel-spectrum, F0,
Band-aperiodicity, etc.
1. H. Zen, K. Tokuda, and A. W. Black. Statistical parametric speech synthesis. Speech Communication, 51:1039–1064, 2009.
INTRODUCTION
7. Text-to-speech synthesis
Recent TTS frameworks
7
Front-end
(Text-analyzer)
Back-end
Waveform
generator
Acoustic
models
Text
Trimmed
front-end
‘end-to-end’ TTS system
Text Waveform
module
Attention-based
acoustic model
Front-end
(Text-analyzer)
Unified back-end
Text Waveform
module
Pre-
processing
A. van den Oord, et al. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
Y. Wang, et al. Tacotron: Towards end-to-end speech synthesis. In Proc. Interspeech, pages 4006–4010, 2017.
J. Shen, et al. Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In Proc. ICASSP, pages 4779–4783, 2017
INTRODUCTION
14. MCNN
GELP
14
THEORY: AR NEURAL WAVEFORM MODEL
Flow-based model
FloWaveNetWaveGlow No AR, nor flow
Neural source-filter
Model (NSF)
• No explicit
• Spectral-domain training criterion
• Source-filter architecture
ClariNet
Parallel
WaveNet
Naïve
model
AR neural model
WaveRNNSampleRNN
FFTNet
WaveNet
LPCNetExcitNet GlotNet
Jordan
network
Michael I. Jordan. Serial order: A parallel distributed processing approach. Technical Report 8604, Institute for Cognitive Science, 1986.
Overview
15. 15
THEORY: AR NEURAL WAVEFORM MODEL
General idea
Training: teacher forcing 1
1 2 3 4 T…
…
…1 2 3
Natural
waveform
1 R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
16. 16
General idea
Training: teacher forcing 1
1 2 3 4 T…
…
…1 2 3
1 R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
THEORY: AR NEURAL WAVEFORM MODEL
18. MCNN
GELP
18
Flow-based model
FloWaveNetWaveGlow No AR, nor flow
Neural source-filter
Model (NSF)
• No explicit
• Spectral-domain training criterion
• Source-filter architecture
ClariNet
Parallel
WaveNet
Naïve
model
WaveNet
• Tractable probability & powerful AR dependency
• Slow sequential generation & only left-to-right dependency
WaveRNN 1
• Batch-sampling: faster generation
• Subscale-dependency: more than left-to-right dependency
LPCNet & GlotNet 2,3
• Classical AR + neural AR
AR neural model
WaveRNNSampleRNN
FFTNet
WaveNet
LPCNetExcitNet GlotNet
1. N. Kalchbrenner, et al Efficient neural audio synthesis. In Proc. ICML, volume 80, pages 2410–2419, 10–15 Jul 2018.
2. J.-M. Valin and J. Skoglund. LPCNet: Improving neural speech synthesis through linear prediction. In Proc. ICASSP, pages 5891–5895, 2019.
3. L. Juvela, et al . Speaker-independent raw waveform model for glottal excitation. In Proc. Interspeech 2018, pages 2012–2016, 2018.
THEORY: AR NEURAL WAVEFORM MODEL
19. MCNN
GELP
Flow-based model
FloWaveNetWaveGlow
19
THEORY: FLOW-BASED MODELS
No AR, nor flow
Neural source-filter
Model (NSF)
• No explicit
• Spectral-domain training criterion
• Source-filter architecture
ClariNet
Parallel
WaveNet
Naïve
model
AR neural model
WaveRNNSampleRNN
FFTNet
WaveNet
LPCNetExcitNet GlotNet
Flow-based model
FloWaveNetWaveGlow
Fast generation?
20. 20
Revisit AR model
Consider an AR model using a Gaussian distribution
1 2 3 T
1 2 3
NN
1 2 3 T
1 2 3
NN
Training
Generation
G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. Proc NIPS, pages 2338–2347, 2017.
THEORY: FLOW-BASED MODELS
Or equivalently
21. 21
Revisit AR model
Consider an AR model using a Gaussian distribution
1 2 3 T
1 2 3
NN
1 2 3 T
1 2 3
NN
Training
Generation
z-1 denotes time delay
See proof of in appendix
NN
z-1
H(.)
NN
z-1
H-1(.)
G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. Proc NIPS, pages 2338–2347, 2017.
THEORY: FLOW-BASED MODELS
22. 22
Revisit AR model
Consider an AR model using a Gaussian distribution
1 2 3 T
1 2 3
NN
1 2 3 T
1 2 3
NN
Training
Generation
NN
z-1
H(.)
NN
z-1
H-1(.)
z-1 denotes time delay
See proof of in appendix
G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. Proc NIPS, pages 2338–2347, 2017.
Such an AR model is a flow-based model
Training:
1. Transform o1:T to n1:T
2. Maximizing n1:T likelihood over N(nt 0, 1)
Generation:
1. Sample nt from N(nt 0, 1)
2. Transform nt to ot
3. Repeat from t=1 to t=T
THEORY: FLOW-BASED MODELS
23. 23
THEORY: FLOW-BASED MODELS
From AR to Inverse AR flow-based model
z-1 denotes time delay
NN
z-1
H(.)
Training
Generation
NN
z-1
H(.)
NN
z-1
H-1(.) NN
z-1
H-1(.)
AR flow Inverse-AR flow
24. 24
THEORY: FLOW-BASED MODELS
From AR to Inverse AR flow-based model
z-1 denotes time delay
NN
z-1
H(.)
Training
Generation
NN
z-1
H-1(.)
AR flow
NN
z-1
H(.)
NN
z-1
H-1(.)
Inverse-AR flow
✓ O(1)
! O(T)✓ O(1)
! O(T)
Knowledge distilling
Parallel WaveNet & ClariNet
25. 25
MCNN
No AR, nor flow
Neural source-filter
Model (NSF)
Naïve
model
AR neural model
WaveRNNSampleRNN
FFTNet
WaveNet
LPCNetExcitNet GlotNet
WaveGlow1 & FloWaveNet2
• Fast generation & slow training
Parallel WaveNet3 & ClariNet4
• Knowledge-distilling is complicated
1. R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. ICASSP 2019, 2018.
2. S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. ICML 2019.
3. A. van den Oord, Y. Li, I. Babuschkin, et. al.. Parallel WaveNet: Fast high-fidelity speech synthesis. Proc. ICML, pages 3918–3926, 2018.
4. W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. ICLR, 2018.
Inverse AR flow
FloWaveNetWaveGlow
ClariNet
Parallel
WaveNet
THEORY: FLOW-BASED MODELS
26. • Faster training & generation
• Easy to implement
Inverse AR flow
FloWaveNetWaveGlow
26
THEORY: NEURAL SOURCE-FILTER MODEL
No AR, no flow
Neural source-filter
Model (NSF) 1
• Source-filter architecture
• Spectral-domain training criterion
Naïve
model
AR neural model
WaveRNNSampleRNN
FFTNet
WaveNet
LPCNetExcitNet GlotNet
ClariNet
Parallel
WaveNet
1. X. Wang, et al. Neural source-filter-based waveform model for statistical para- metric speech synthesis. In Proc. ICASSP, pages 5916–5920,
2019.
2. S. O ̈. Arık, et. al. Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Processing Letters, 26(1):94–98, 2018.
MCNN2
GELP3
27. 27
THEORY: NEURAL SOURCE-FILTER MODEL
1 2 3 4 T…
…
General idea
• No AR or inverse AR flow
…
1 2 3 4 T
‘Filter’
Natural
waveform
Generated
waveform
1 2 3 4 TF0/pitch ‘Source’
28. 28
THEORY: NEURAL SOURCE-FILTER MODEL
1 2 3 4 T…
…
General idea
• Based on short-time Fourier transform (STFT)
…
Generated
waveform
Natural
waveform 1 2 3 4 T
Spectral
distance …
…
1 2 3 4 TF0/pitch
29. …
…
1 2 3 4 TF0/pitch
29
THEORY: NEURAL SOURCE-FILTER MODEL
1 2 3 4 T…
Probabilistic interpretation?
Generated
waveform
Natural
waveform 1 2 3 4 T
Spectral
distance …
…
What is the ?
30. 30
THEORY: NEURAL SOURCE-FILTER MODEL
1 2 3 4 T…
Probabilistic interpretation?
• Spectral distance
1 2 3 4 T…
Framing
Framing
Spectral
distance
FFT
FFT
, where D is frame length. where K is FFT points.
31. 31
THEORY: NEURAL SOURCE-FILTER MODEL
1 2 3 4 T…
Probabilistic interpretation?
•
1 2 3 4 T…
Framing
Framing
FFT
FFT
Likelihood
over Gaussian
For explanation, denotes spectral power vector
, where D is frame length. where K is FFT points.
32. 32
THEORY: NEURAL SOURCE-FILTER MODEL
Probabilistic interpretation?
•
1 2 3 4 T…
Framing
FFT
1 2 3 4 T…
Framing
FFT
Likelihood
over Gaussian
, where K is FFT points
For explanation, denotes spectral power vector
35. 35
PRACTICE: WAVENET
WaveNet variants
Discretized or continuous-valued waveforms
• Two practical issues:
1. How to generate waveform samples? ➣
2. How to train WaveNet Gaussian? ➣
1 2
1 2 3 4
3 1 2
1 2 3 4
3
GMM/GaussianSoftmax
➣
36. 36
PRACTICE: WAVENET
WaveNet variants
Discretized or continuous-valued waveforms
Other variants
• WaveNet using mixture of logistic distribution 1
• WaveNet + Spline 2
• Quantization noise shaping 3, related noise shaping method 4
1. T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
2. Y. Agiomyrgiannakis. B-spline PDF: A generalization of histograms to continuous density models for generative audio networks. In Proc. ICASSP, pages 5649–5653. IEEE, 2018.
3. T. Yoshimura, et al. Mel-cepstrum-based quantization noise shaping applied to neural-network-based speech waveform synthesis. IEEE TASLP, 26(7):1173–1180, 2018.
4. K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of noise shaping with perceptual weighting for WaveNet-based speech generation. In Proc. ICASSP, pages 5664–5668. IEEE, 2018.
1 2
1 2 3 4
3
Softmax
1 2
1 2 3 4
3
GMM/Gaussian
37. Generation strategy
WaveNet-softmax
• Generation as a search problem
• Search space: 256T for 8-bits waveform of length T
1 2
1 2 3 4
3
37
PRACTICE: WAVENET
1 2 3 4
… … …
…
38. Generation strategy
WaveNet-softmax
• Sub-optimal search by
o Exploitation
o Exploration
o Or mix of both
38
PRACTICE: WAVENET
1 2 3 4
… … …
…
Random sampling
Greedy search
43. Generation strategy
WaveNet-softmax
• Exploitation & exploration
• Other strategy: temperature of softmax 1
WaveNet-Gaussian
• Infinite search space: the best is impossible
• Same strategy as WaveNet-softmax
43
PRACTICE: WAVENET
Greedy best?
Sampling?
1. Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu. FFTNet: A real-time speaker-dependent neural vocoder. In Proc. ICASSP, pages 2251–2255, 2018.
49. PRACTICE: WAVENET
Training stability
WaveNet-Gaussian
• Our two-steps strategy
1. Train blue part with
2. Train red part only
• Gradient will be mild
1. Minimizes while keep gradient mild
2. Gradient not explode when
49
NN
1 2
1 2 3 4
3
89. Generation speed
Mem-save mode: allocate and release GPU memory layer by layer
(limited by our CUDA implemetation)
Normal mode: allocate GPU memory once
89
How many waveform points can be generated in 1s (Tesla p100)?
PRACTICE: COMPARISON
WaveNet
softmax
b-NSF hn-NSF
trainable MVF
WaveNet
Gaussian
s-NSF hn-NSF
fixed MVF
WORLD
vocoder
93. 93
BEYOND SPEECH
Music performance
Training
• URPM dataset1
o ground-truth F0
o 13 instruments
o solo recording
• One model for all instruments
1 University of Rochester Multi-Modal Music Performance (URMP) Dataset http://www2.ece.rochester.edu/projects/air/projects/URMP.html
Neural
waveform
model
F0
Mel-spectra
94. Natural b-NSF S-NSF
hn-NSF
trainable MVF
Violin
Viola
Oboe
Trumpet
Saxophone
BEYOND SPEECH
Music performance
Testing with natural Mel-spectra and F0 as input
WaveNet
95. Natural b-NSF S-NSF
hn-NSF
trainable MVF
Horn
Trombone
Tuba
Clarinet
Flute
BEYOND SPEECH
Music performance
Testing with natural Mel-spectra and F0 as input
97. Questions & Comments
are always Welcome!
97
https://nii-yamagishilab.github.io/samples-nsf/index.html
98. 98
REFERENCE
WaveNet: A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu.
WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
SampleRNN: S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. Samplernn: An unconditional end-to-end
neural audio generation model. arXiv preprint arXiv:1612.07837, 2016.
WaveRNN: N. Kalchbrenner, E. Elsen, K. Simonyan, et.al. Efficient neural audio synthesis. In J. Dy and A. Krause, editors, Proc. ICML, volume
80 of Proceedings of Machine Learning Research, pages 2410–2419, 10–15 Jul 2018.
FFTNet: Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu. FFTNet: A real-time speaker-dependent neural vocoder. In Proc. ICASSP, pages 2251–
2255. IEEE, 2018.
Universal vocoder: J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, and R. Barra-Chicote. Robust universal neural vocoding. arXiv
preprint arXiv:1811.06292, 2018.
Subband WaveNet: T. Okamoto, K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of subband wavenet vocoder covering entire audible
frequency range with limited acoustic features. In Proc. ICASSP, pages 5654–5658. 2018.
Parallel WaveNet: A. van den Oord, Y. Li, I. Babuschkin, et. al.. Parallel WaveNet: Fast high-fidelity speech synthesis. In Proc. ICML, pages 3918–
3926, 2018.
ClariNet: W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281,
2018.
FlowWaveNet: S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. arXiv preprint arXiv:1811.02155, 2018.
WaveGlow: R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint
arXiv:1811.00002, 2018.
RNN+STFT: S. Takaki, T. Nakashika, X. Wang, and J. Yamagishi. STFT spectral loss for training a neural speech waveform model. In Proc. ICASSP
(submitted), 2018.
NSF: X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-based waveform model for statistical para- metric speech synthesis. arXiv
preprint arXiv:1810.11946, 2018.
LP-WavNet: M.-J. Hwang, F. Soong, F. Xie, X. Wang, and H.-G. Kang. Lp-wavenet: Linear prediction-based wavenet speech synthesis. arXiv
preprint arXiv:1811.11913, 2018.
GlotNet: L. Juvela, V. Tsiaras, B. Bollepalli, M. Airaksinen, J. Yamagishi, and P. Alku. Speaker-independent raw waveform model for glottal
excitation. arXiv preprint arXiv:1804.09593, 2018.
ExcitNet: E. Song, K. Byun, and H.-G. Kang. Excitnet vocoder: A neural excitation model for parametric speech synthesis systems. arXiv
preprint arXiv:1811.04769, 2018.
LPCNet: J.-M. Valin and J. Skoglund. Lpcnet: Improving neural speech synthesis through linear prediction. arXiv preprint arXiv:1810.11846,
2018.
MCNN: S. O ̈. Arık, H. Jun, and G. Diamos. Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal
Processing Letters, 26(1):94–98, 2018.
GELP: J. Lauri, et. al. GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram, Proc. Interspeech, 2019
103. 103
FLOW-BASED MODELS
Recap AR model
Consider a WaveNet using a Gaussian distribution
1. Because , we have
1 2 3 T
1 2 3
NN
z-1 denotes time delay
NN
z-1
H-1(.)
105. 105
FLOW-BASED MODELS
Recap AR model
Consider a WaveNet using a Gaussian distribution
2. Because , we have
3. Therefore
z-1 denotes time delay
Triangle-matrix,
as nt depends on o<t
110. 110
FLOW-BASED MODELS
z-1 denotes time delay
NN
z-1
H-1(.)NN
z-1
H-1(.)
AR flow
AR flow vs inverse-AR
Inverse-AR flow
Editor's Notes
This work is licensed under the Creative Commons Attribution 3.0 License. All slides may be reused for non-commercial purposes provided full attribution is made to the National Institute of Informatics. (See http://creativecommons.org/ for details.)
Many questions to ask
How to design frequency-domain distance
How to design source module
How to design condition module / what input features should be used