A large and growing amount of speech content in real-life scenarios is being recorded on consumer-grade devices in uncontrolled environments, resulting in degraded speech quality. Transforming such low-quality device-degraded speech into high-quality speech is a goal of speech enhancement (SE). This paper introduces a new speech dataset, DDS, to facilitate the research on SE. DDS provides aligned parallel recordings of high-quality speech (recorded in professional studios) and a number of versions of low-quality speech, producing approximately 2,000 hours speech data. The DDS dataset covers 27 realistic recording conditions by combining diverse acoustic environments and microphone devices, and each version of a condition consists of multiple recordings from six microphone positions to simulate different noise and reverberation levels. We also test several SE baseline systems on the DDS dataset and show the impact of recording diversity on performance.
Paper: https://arxiv.org/abs/2109.07931
Presentation for Interspeech 2022: "The VoiceMOS Challenge 2022"
Presenter: Dr. Erica Cooper, National Institute of Informatics
Preprint: https://arxiv.org/abs/2203.11389
Video: https://youtu.be/99ZQ-SLUvKE
Challenge website: https://voicemos-challenge-2022.github.io
Thu-SS-OS-9-5
We present the first edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthetic speech. This challenge drew 22 participating teams from academia and industry who tried a variety of approaches to tackle the problem of predicting human ratings of synthesized speech. The listening test data for the main track of the challenge consisted of samples from 187 different text-to-speech and voice conversion systems spanning over a decade of research, and the out-of-domain track consisted of data from more recent systems rated in a separate listening test. Results of the challenge show the effectiveness of fine-tuning self-supervised speech models for the MOS prediction task, as well as the difficulty of predicting MOS ratings for unseen speakers and listeners, and for unseen systems in the out-of-domain setting.
Presentation for Interspeech 2022: "Analyzing Language-Independent Speaker Anonymization Framework under Unseen Conditions"
Presenter: Dr. Xiaoxiao Miao, National Institute of Informatics
Thu-O-OS-9-1
Video: https://youtu.be/wVIxyLiQa1Y
Preprint: Preprint: https://arxiv.org/abs/2203.14834
In our previous work, we proposed a language-independent speaker anonymization system based on self-supervised learning models. Although the system can anonymize speech data of any language, the anonymization was imperfect, and the speech content of the anonymized speech was distorted. This limitation is more severe when the input speech is from a domain unseen in the training data. This study analyzed the bottleneck of the anonymization system under unseen conditions. It was found that the domain (e.g., language and channel) mismatch between the training and test data affected the neural waveform vocoder and anonymized speaker vectors, which limited the performance of the whole system. Increasing the training data diversity for the vocoder was found to be helpful to reduce its implicit language and channel dependency. Furthermore, a simple correlation-alignment-based domain adaption strategy was found to be significantly effective to alleviate the mismatch on the anonymized speaker vectors. Audio samples and source code are available online.
Presentation for Interspeech 2022: Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials Sampling Strategy for SASVC 2022
Presenter: Chang Zeng (National Institute of Informatics and SOKENDAI)
Wed-SS-OS-6-5
Presentation video: https://youtu.be/gXxP1nn5X6E
The spoofing aware speaker verification challenge (SASVC) 2022 has been organized to explore the relation between automatic speaker verification (ASV) and spoof countermeasure (CM). In this paper, we will introduce our proposed spoofing- aware attention back-end developed for SASVC 2022. First, we design a novel sampling strategy for simulating real verification scenario. Then, in order to fully leverage information derived from multiple enrollments, a spoofing-aware attention back-end has been proposed. Finally, a joint decision strategy is aggregated to introduce mutual interaction between ASV module and CM module. Compared with the trial sampling method used in baseline systems, our proposed sampling method shows effective improvement without any attention modules. The experimental result shows our proposed spoofing-aware attention back-end improves the performance from 6.37% of best baseline system on evaluation dataset to 1.19% in term of SASV- EER (equal error rate) metric.
Presenter: Dr. Xiaoxiao Miao, NII
Paper: https://arxiv.org/abs/2202.13097
Speaker anonymization aims to protect the privacy of speakers while preserving spoken linguistic information from speech. Current mainstream neural network speaker anonymization systems are complicated, containing an F0 extractor, speaker encoder, automatic speech recognition acoustic model (ASR AM), speech synthesis acoustic model and speech waveform generation model. Moreover, as an ASR AM is language-dependent, trained on English data, it is hard to adapt it into another language. In this paper, we propose a simpler self-supervised learning (SSL)-based method for language-independent speaker anonymization without any explicit language-dependent model, which can be easily used for other languages. Extensive experiments were conducted on the VoicePrivacy Challenge 2020 datasets in English and AISHELL-3 datasets in Mandarin to demonstrate the effectiveness of our proposed SSL-based language-independent speaker anonymization method.
Presenter: Dr. Xin Wang, NII
Paper: https://arxiv.org/abs/2111.07725
Self-supervised speech model is a rapid progressing research topic, and many pre-trained models have been released and used in various down stream tasks. For speech anti-spoofing, most countermeasures (CMs) use signal processing algorithms to extract acoustic features for classification. In this study, we use pre-trained self-supervised speech models as the front end of spoofing CMs. We investigated different back end architectures to be combined with the self-supervised front end, the effectiveness of fine-tuning the front end, and the performance of using different pre-trained self-supervised models. Our findings showed that, when a good pre-trained front end was fine-tuned with either a shallow or a deep neural network-based back end on the ASVspoof 2019 logical access (LA) training set, the resulting CM not only achieved a low EER score on the 2019 LA test set but also significantly outperformed the baseline on the ASVspoof 2015, 2021 LA, and 2021 deepfake test sets. A sub-band analysis further demonstrated that the CM mainly used the information in a specific frequency band to discriminate the bona fide and spoofed trials across the test sets.
A large and growing amount of speech content in real-life scenarios is being recorded on consumer-grade devices in uncontrolled environments, resulting in degraded speech quality. Transforming such low-quality device-degraded speech into high-quality speech is a goal of speech enhancement (SE). This paper introduces a new speech dataset, DDS, to facilitate the research on SE. DDS provides aligned parallel recordings of high-quality speech (recorded in professional studios) and a number of versions of low-quality speech, producing approximately 2,000 hours speech data. The DDS dataset covers 27 realistic recording conditions by combining diverse acoustic environments and microphone devices, and each version of a condition consists of multiple recordings from six microphone positions to simulate different noise and reverberation levels. We also test several SE baseline systems on the DDS dataset and show the impact of recording diversity on performance.
Paper: https://arxiv.org/abs/2109.07931
Presentation for Interspeech 2022: "The VoiceMOS Challenge 2022"
Presenter: Dr. Erica Cooper, National Institute of Informatics
Preprint: https://arxiv.org/abs/2203.11389
Video: https://youtu.be/99ZQ-SLUvKE
Challenge website: https://voicemos-challenge-2022.github.io
Thu-SS-OS-9-5
We present the first edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthetic speech. This challenge drew 22 participating teams from academia and industry who tried a variety of approaches to tackle the problem of predicting human ratings of synthesized speech. The listening test data for the main track of the challenge consisted of samples from 187 different text-to-speech and voice conversion systems spanning over a decade of research, and the out-of-domain track consisted of data from more recent systems rated in a separate listening test. Results of the challenge show the effectiveness of fine-tuning self-supervised speech models for the MOS prediction task, as well as the difficulty of predicting MOS ratings for unseen speakers and listeners, and for unseen systems in the out-of-domain setting.
Presentation for Interspeech 2022: "Analyzing Language-Independent Speaker Anonymization Framework under Unseen Conditions"
Presenter: Dr. Xiaoxiao Miao, National Institute of Informatics
Thu-O-OS-9-1
Video: https://youtu.be/wVIxyLiQa1Y
Preprint: Preprint: https://arxiv.org/abs/2203.14834
In our previous work, we proposed a language-independent speaker anonymization system based on self-supervised learning models. Although the system can anonymize speech data of any language, the anonymization was imperfect, and the speech content of the anonymized speech was distorted. This limitation is more severe when the input speech is from a domain unseen in the training data. This study analyzed the bottleneck of the anonymization system under unseen conditions. It was found that the domain (e.g., language and channel) mismatch between the training and test data affected the neural waveform vocoder and anonymized speaker vectors, which limited the performance of the whole system. Increasing the training data diversity for the vocoder was found to be helpful to reduce its implicit language and channel dependency. Furthermore, a simple correlation-alignment-based domain adaption strategy was found to be significantly effective to alleviate the mismatch on the anonymized speaker vectors. Audio samples and source code are available online.
Presentation for Interspeech 2022: Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials Sampling Strategy for SASVC 2022
Presenter: Chang Zeng (National Institute of Informatics and SOKENDAI)
Wed-SS-OS-6-5
Presentation video: https://youtu.be/gXxP1nn5X6E
The spoofing aware speaker verification challenge (SASVC) 2022 has been organized to explore the relation between automatic speaker verification (ASV) and spoof countermeasure (CM). In this paper, we will introduce our proposed spoofing- aware attention back-end developed for SASVC 2022. First, we design a novel sampling strategy for simulating real verification scenario. Then, in order to fully leverage information derived from multiple enrollments, a spoofing-aware attention back-end has been proposed. Finally, a joint decision strategy is aggregated to introduce mutual interaction between ASV module and CM module. Compared with the trial sampling method used in baseline systems, our proposed sampling method shows effective improvement without any attention modules. The experimental result shows our proposed spoofing-aware attention back-end improves the performance from 6.37% of best baseline system on evaluation dataset to 1.19% in term of SASV- EER (equal error rate) metric.
Presenter: Dr. Xiaoxiao Miao, NII
Paper: https://arxiv.org/abs/2202.13097
Speaker anonymization aims to protect the privacy of speakers while preserving spoken linguistic information from speech. Current mainstream neural network speaker anonymization systems are complicated, containing an F0 extractor, speaker encoder, automatic speech recognition acoustic model (ASR AM), speech synthesis acoustic model and speech waveform generation model. Moreover, as an ASR AM is language-dependent, trained on English data, it is hard to adapt it into another language. In this paper, we propose a simpler self-supervised learning (SSL)-based method for language-independent speaker anonymization without any explicit language-dependent model, which can be easily used for other languages. Extensive experiments were conducted on the VoicePrivacy Challenge 2020 datasets in English and AISHELL-3 datasets in Mandarin to demonstrate the effectiveness of our proposed SSL-based language-independent speaker anonymization method.
Presenter: Dr. Xin Wang, NII
Paper: https://arxiv.org/abs/2111.07725
Self-supervised speech model is a rapid progressing research topic, and many pre-trained models have been released and used in various down stream tasks. For speech anti-spoofing, most countermeasures (CMs) use signal processing algorithms to extract acoustic features for classification. In this study, we use pre-trained self-supervised speech models as the front end of spoofing CMs. We investigated different back end architectures to be combined with the self-supervised front end, the effectiveness of fine-tuning the front end, and the performance of using different pre-trained self-supervised models. Our findings showed that, when a good pre-trained front end was fine-tuned with either a shallow or a deep neural network-based back end on the ASVspoof 2019 logical access (LA) training set, the resulting CM not only achieved a low EER score on the 2019 LA test set but also significantly outperformed the baseline on the ASVspoof 2015, 2021 LA, and 2021 deepfake test sets. A sub-band analysis further demonstrated that the CM mainly used the information in a specific frequency band to discriminate the bona fide and spoofed trials across the test sets.
SSW11 presentation: How do Voices from Past Speech Synthesis Challenges Compare Today?
Presenter: Erica Cooper
Preprint: https://arxiv.org/abs/2105.02373
Presentation for SSW11: "Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance"
Presenter: Hieu-Thi Luong
Preprint: https://arxiv.org/abs/2106.13479
Tutorial on neural vocoders at the 2021 Speech Processing Courses in Crete, "Inclusive Neural Speech Synthesis."
Presenters: Xin Wang and Junichi Yamagishi, National Institute of Informatics, Japan
The document proposes a neural source-filter waveform model (NSF) for speech synthesis. The NSF model consists of three modules: a condition module that upsamples spectral features and F0, a source module that generates a sine excitation signal, and a filter module with dilated convolutional blocks. The model is trained directly on waveforms using a spectral distance criterion in the STFT domain. Experiments show the NSF model generates high quality waveforms comparable to WaveNet, with faster generation speed. Ablation tests analyze the importance of the sine excitation source and different spectral loss terms. The NSF provides a simpler alternative to autoregressive models for neural speech synthesis.
This document provides an overview of end-to-end text-to-speech synthesis models including Char2Wav, Tacotron, Tacotron2, and the development of a Japanese Tacotron model. It describes the typical encoder-decoder architecture with attention, improvements made in subsequent models like Tacotron2 using larger models and more regularization, and the implementation of a Japanese Tacotron to model Japanese pitch accents using an accent embedding and self-attention.
These are slides used for invited tutorial on "end-to-end text-to-speech synthesis", given at IEICE SP workshop held on 27th Jan 2019.
Part 1: Neural waveform modeling
Presenters: Xin Wang, Yusuke Yasuda (National Institute of Informatics, Japan)
This document discusses end-to-end text-to-speech synthesis models and summarizes several key models:
- Char2Wav was one of the earliest end-to-end models using an encoder-decoder with attention and a neural vocoder. It helped prove the concept but had limitations in target features and architecture.
- Tacotron improved upon Char2Wav with its CBHG encoder, attention mechanisms, and predicting mel spectrograms as targets. However, training was slow and waveform generation was limited.
- Tacotron2 achieved near-human quality by extending Tacotron and generating waveforms with WaveNet conditioned on predicted mel spectrograms.
The document also describes a Japanese Tacotron model that incorporates
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfTechgropse Pvt.Ltd.
In this blog post, we'll delve into the intersection of AI and app development in Saudi Arabia, focusing on the food delivery sector. We'll explore how AI is revolutionizing the way Saudi consumers order food, how restaurants manage their operations, and how delivery partners navigate the bustling streets of cities like Riyadh, Jeddah, and Dammam. Through real-world case studies, we'll showcase how leading Saudi food delivery apps are leveraging AI to redefine convenience, personalization, and efficiency.
More Related Content
More from Yamagishi Laboratory, National Institute of Informatics, Japan
SSW11 presentation: How do Voices from Past Speech Synthesis Challenges Compare Today?
Presenter: Erica Cooper
Preprint: https://arxiv.org/abs/2105.02373
Presentation for SSW11: "Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance"
Presenter: Hieu-Thi Luong
Preprint: https://arxiv.org/abs/2106.13479
Tutorial on neural vocoders at the 2021 Speech Processing Courses in Crete, "Inclusive Neural Speech Synthesis."
Presenters: Xin Wang and Junichi Yamagishi, National Institute of Informatics, Japan
The document proposes a neural source-filter waveform model (NSF) for speech synthesis. The NSF model consists of three modules: a condition module that upsamples spectral features and F0, a source module that generates a sine excitation signal, and a filter module with dilated convolutional blocks. The model is trained directly on waveforms using a spectral distance criterion in the STFT domain. Experiments show the NSF model generates high quality waveforms comparable to WaveNet, with faster generation speed. Ablation tests analyze the importance of the sine excitation source and different spectral loss terms. The NSF provides a simpler alternative to autoregressive models for neural speech synthesis.
This document provides an overview of end-to-end text-to-speech synthesis models including Char2Wav, Tacotron, Tacotron2, and the development of a Japanese Tacotron model. It describes the typical encoder-decoder architecture with attention, improvements made in subsequent models like Tacotron2 using larger models and more regularization, and the implementation of a Japanese Tacotron to model Japanese pitch accents using an accent embedding and self-attention.
These are slides used for invited tutorial on "end-to-end text-to-speech synthesis", given at IEICE SP workshop held on 27th Jan 2019.
Part 1: Neural waveform modeling
Presenters: Xin Wang, Yusuke Yasuda (National Institute of Informatics, Japan)
This document discusses end-to-end text-to-speech synthesis models and summarizes several key models:
- Char2Wav was one of the earliest end-to-end models using an encoder-decoder with attention and a neural vocoder. It helped prove the concept but had limitations in target features and architecture.
- Tacotron improved upon Char2Wav with its CBHG encoder, attention mechanisms, and predicting mel spectrograms as targets. However, training was slow and waveform generation was limited.
- Tacotron2 achieved near-human quality by extending Tacotron and generating waveforms with WaveNet conditioned on predicted mel spectrograms.
The document also describes a Japanese Tacotron model that incorporates
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfTechgropse Pvt.Ltd.
In this blog post, we'll delve into the intersection of AI and app development in Saudi Arabia, focusing on the food delivery sector. We'll explore how AI is revolutionizing the way Saudi consumers order food, how restaurants manage their operations, and how delivery partners navigate the bustling streets of cities like Riyadh, Jeddah, and Dammam. Through real-world case studies, we'll showcase how leading Saudi food delivery apps are leveraging AI to redefine convenience, personalization, and efficiency.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
3. 3
TTS & MIDI-to-Audio Synthesis
0 100 200 300 400 500
Frame index
256
513
Frequency
bins
Acoustic
model
Waveform
model
Wav
MIDI
Piano roll
Acoustic
features
MIDI
API
Acoustic
model
Waveform
model
Context
vectors
Acoustic
features
Front
end
Wav
.txt
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
or
Pitch is crucial for music
5. 5
Acoustic
model
Waveform
model
Wav
MIDI
Piano roll
Acoustic
features
MIDI
API
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
or
TTS & MIDI-to-Audio Synthesis
0 100 200 300 400 500
Frame index
256
513
Frequency
bins
Acoustic
model
Waveform
model
Context
vectors
Acoustic
features
Front
end
Wav
.txt
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
1
0
0
Phone ID (one-hot)
Syllable ID (one-hot)
#. Phone
#. Syllables
<latexit sha1_base64="YYtcP5HcZDv4wo3N4OOgsso9tf8=">AAAC6nicbVLLahsxFJWnr8R9Je0ym6Gm0E3MTElploFuugouxInBMxiN5toW1mOQ7jQ2Qj+RXUmW/Zz+QP+m8iPQmeSC0OHce+5LKirBLSbJ30705Omz5y/29rsvX71+8/bg8N2l1bVhMGRaaDMqqAXBFQyRo4BRZYDKQsBVsfi29l/9BGO5Vhe4qiCXdKb4lDOKgRplhXVLP1GTg17STzYWPwTpDvTIzgaTw86frNSslqCQCWrtOE0qzB01yJkA381qCxVlCzqDcYCKSrC52zTs44+BKeOpNuEojDfs/wpHpbUrWYRISXFu2741+ZhvXOP0NHdcVTWCYttC01rEqOP19HHJDTAUqwAoMzz0GrM5NZRh2FE3U3DNtJRUlS5sxo/TPNxalOtetHC91DcG246DhXiEbVIzQ6s5Z0vfbRYZXBhOhXcZmwNbSGoWvhUAymgRIo5bjvN7JcISNxWdgbKRqa24T9WWHHsf3j9tv/ZDcPm5n37pJz9Oemfp7ifskSPygXwiKflKzsh3MiBDwoggN+SW3EUiuol+Rbfb0Kiz07wnDYt+/wP+EPS4</latexit>
xn
<latexit sha1_base64="YYtcP5HcZDv4wo3N4OOgsso9tf8=">AAAC6nicbVLLahsxFJWnr8R9Je0ym6Gm0E3MTElploFuugouxInBMxiN5toW1mOQ7jQ2Qj+RXUmW/Zz+QP+m8iPQmeSC0OHce+5LKirBLSbJ30705Omz5y/29rsvX71+8/bg8N2l1bVhMGRaaDMqqAXBFQyRo4BRZYDKQsBVsfi29l/9BGO5Vhe4qiCXdKb4lDOKgRplhXVLP1GTg17STzYWPwTpDvTIzgaTw86frNSslqCQCWrtOE0qzB01yJkA381qCxVlCzqDcYCKSrC52zTs44+BKeOpNuEojDfs/wpHpbUrWYRISXFu2741+ZhvXOP0NHdcVTWCYttC01rEqOP19HHJDTAUqwAoMzz0GrM5NZRh2FE3U3DNtJRUlS5sxo/TPNxalOtetHC91DcG246DhXiEbVIzQ6s5Z0vfbRYZXBhOhXcZmwNbSGoWvhUAymgRIo5bjvN7JcISNxWdgbKRqa24T9WWHHsf3j9tv/ZDcPm5n37pJz9Oemfp7ifskSPygXwiKflKzsh3MiBDwoggN+SW3EUiuol+Rbfb0Kiz07wnDYt+/wP+EPS4</latexit>
xn
A4, velocity 0.4
0.9
0
D5, velocity 0.4
…
0.4
…
0
<latexit sha1_base64="YYtcP5HcZDv4wo3N4OOgsso9tf8=">AAAC6nicbVLLahsxFJWnr8R9Je0ym6Gm0E3MTElploFuugouxInBMxiN5toW1mOQ7jQ2Qj+RXUmW/Zz+QP+m8iPQmeSC0OHce+5LKirBLSbJ30705Omz5y/29rsvX71+8/bg8N2l1bVhMGRaaDMqqAXBFQyRo4BRZYDKQsBVsfi29l/9BGO5Vhe4qiCXdKb4lDOKgRplhXVLP1GTg17STzYWPwTpDvTIzgaTw86frNSslqCQCWrtOE0qzB01yJkA381qCxVlCzqDcYCKSrC52zTs44+BKeOpNuEojDfs/wpHpbUrWYRISXFu2741+ZhvXOP0NHdcVTWCYttC01rEqOP19HHJDTAUqwAoMzz0GrM5NZRh2FE3U3DNtJRUlS5sxo/TPNxalOtetHC91DcG246DhXiEbVIzQ6s5Z0vfbRYZXBhOhXcZmwNbSGoWvhUAymgRIo5bjvN7JcISNxWdgbKRqa24T9WWHHsf3j9tv/ZDcPm5n37pJz9Oemfp7ifskSPygXwiKflKzsh3MiBDwoggN+SW3EUiuol+Rbfb0Kiz07wnDYt+/wP+EPS4</latexit>
xn
<latexit sha1_base64="YYtcP5HcZDv4wo3N4OOgsso9tf8=">AAAC6nicbVLLahsxFJWnr8R9Je0ym6Gm0E3MTElploFuugouxInBMxiN5toW1mOQ7jQ2Qj+RXUmW/Zz+QP+m8iPQmeSC0OHce+5LKirBLSbJ30705Omz5y/29rsvX71+8/bg8N2l1bVhMGRaaDMqqAXBFQyRo4BRZYDKQsBVsfi29l/9BGO5Vhe4qiCXdKb4lDOKgRplhXVLP1GTg17STzYWPwTpDvTIzgaTw86frNSslqCQCWrtOE0qzB01yJkA381qCxVlCzqDcYCKSrC52zTs44+BKeOpNuEojDfs/wpHpbUrWYRISXFu2741+ZhvXOP0NHdcVTWCYttC01rEqOP19HHJDTAUqwAoMzz0GrM5NZRh2FE3U3DNtJRUlS5sxo/TPNxalOtetHC91DcG246DhXiEbVIzQ6s5Z0vfbRYZXBhOhXcZmwNbSGoWvhUAymgRIo5bjvN7JcISNxWdgbKRqa24T9WWHHsf3j9tv/ZDcPm5n37pJz9Oemfp7ifskSPygXwiKflKzsh3MiBDwoggN+SW3EUiuol+Rbfb0Kiz07wnDYt+/wP+EPS4</latexit>
xn
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
6. 6
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
TTS & MIDI-to-Audio Synthesis
0 100 200 300 400 500
Frame index
256
513
Frequency
bins
Acoustic
model
Waveform
model
Wav
MIDI
Piano roll
Acoustic
features
MIDI
API
Acoustic
model
Waveform
model
Context
vectors
Acoustic
features
Front
end
Wav
.txt
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
Not completely aligned
AI
performer
<latexit sha1_base64="Ce+Vsx7LGNLmrE4StAaBzOgsSt0=">AAADInicbVLLitswFFXc1zR9ZdplN6Kh0M0Eu7S0dFEGuummQwrJzEBswrWsxCJ6GEnuJBj9zXzN0E2Zbgr9mMpOBmpnLggdzrlHV/dKacGZsWH4pxfcuXvv/oODh/1Hj588fTY4fH5qVKkJnRLFlT5PwVDOJJ1aZjk9LzQFkXJ6lq6+1PrZD6oNU3JiNwVNBCwlWzAC1lPzwec4NdXazavo0zeHY82WuQWt1QWuBWiEk31BNcLEzQfDcBQ2gfdBtANDtIvx/LD3M84UKQWVlnAwZhaFhU0q0JYRTl0/Lg0tgKxgSWceShDUJFXTqMOvPZPhhdJ+SYsb9n9HBcKYjUh9pgCbm65Wk7dps9IuPiYVk0VpqSTbQouSY6twPTWcMU2J5RsPgGjm74pJDhqI9bPtx5JeECUEyKzy03GzKPG74ll9F8WrYeRajW3bsSm/hW1TSw1Fzsja9dtFxhPNgLsqJjklKwF65ToJVGrFfcZRRzi5cVq6tk3FStOsdVLXcXNU13Lk6vePuq+9D07fjqL3o/D7u+FxtPsJB+gleoXeoAh9QMfoKxqjKSLoEl2ha/Q7uAyugl/B9TY16O08L1Argr//ADF1C2g=</latexit>
x1:M ! a1:N ! o1:T
<latexit sha1_base64="uTpzxJX9ikqk62YuUY3zkv6CAV4=">AAADInicdVJda9swFFW8ry77SrfHvYiFwV4a7LGx0YdR2MueSgZJW4hNuJaVWESWjCS3CUL/pr+m7GV0L4P9mMlOCrPbXRA6nHOPru6V0pIzbcLwdy+4d//Bw0d7j/tPnj57/mKw//JEy0oROiWSS3WWgqacCTo1zHB6VioKRcrpabr6Wuun51RpJsXEbEqaFLAUbMEIGE/NB1/iVNu1m9vo8NjhWLFlbkApeYFrAf4nyEaYuPlgGI7CJvBtEO3AEO1iPN/v/YgzSaqCCkM4aD2LwtIkFpRhhFPXjytNSyArWNKZhwIKqhPbNOrwW89keCGVX8Lghv3XYaHQelOkPrMAk+uuVpN3abPKLD4nlomyMlSQbaFFxbGRuJ4azpiixPCNB0AU83fFJAcFxPjZ9mNBL4gsChCZ9dNxsyjxu+RZfRfJ7TByrca27ZiU38G2qaWCMmdk7frtIuOJYsCdjUlOyaoAtXKdBCqU5D7joCMc3zgNXZumolU0a53Uddwc1bUcuPr9o+5r3wYn70fRx1H4/cPwKNr9hD30Gr1B71CEPqEj9A2N0RQRdImu0DX6FVwGV8HP4HqbGvR2nleoFcGfvzQnC2k=</latexit>
x1:N ! a1:N ! o1:T
Aligned
Silent
instrument
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
or
7. 7
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
TTS & MIDI-to-Audio Synthesis
0 100 200 300 400 500
Frame index
256
513
Frequency
bins
Acoustic
model
Waveform
model
Wav
MIDI
Piano roll
Acoustic
features
MIDI
API
Acoustic
model
Waveform
model
Context
vectors
Acoustic
features
Front
end
Wav
.txt
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
Apply TTS techniques
to MIDI-to-Audio
8. 8
Methods
Wang, Y. et al. Tacotron: Towards End-to-End Speech Synthesis. in Proc. Interspeech 4006–4010 (2017).
Yasuda, Y., Wang, X., Takaki, S. & Yamagishi, J. Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent
language. in Proc. ICASSP 6905–6909 (2019).
q Acoustic model
1. TTS model: Tacotron (Wang 2017, Yasuda 2019)
Acoustic
model
Waveform
model
Wav
MIDI
API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
Dense
CBH-
LSTM
Self
Attention
Attention
RNN
Additive
Attention
Forward
Attention
Concat
Concat
Stop
Token
Pre-Net
Decoder
RNN
Self
Attention
Sigmoid Linear Waveform
model
Post-Net
Wav
⊕
⊕
⊕
Encoder Decoder
• Taco2: 800-frame
segments; 4x
downsampling for
better alignments
• Taco3: Warm-
started from taco2;
input current piano-
roll frame at the
decoder pre-net
• Taco4: Warm-
started from taco2;
no downsampling or
piano-roll input to
pre-net
9. 9
Methods
Wang, B. & Yang, Y.-H. PerformanceNet: Score-to-audio music generation with multi-band convolutional residual network. in Proceedings of the
AAAI Conference on Artificial Intelligence vol. 33 1174–1181 (2019).
q Acoustic model
2. Reference model from music field: PerformanceNet (Wang 2019)
Acoustic
model
Waveform
model
Wav
MIDI
API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
Adopted from figure 3 in (Wang 2019)
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
10. 10
Methods
Librosa midi_to_hz: https://librosa.org/doc/0.7.0/generated/librosa.core.midi_to_hz.html
q Acoustic features
1. Mel-spectrogram
2. MIDI-filterbank-spectrogram
Acoustic
model
Waveform
model
Wav
MIDI
API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
120 233 346 460
Frame index
16
32
48
64
80
96
112
128
Pinao
roll
index
(1-128)
Piano roll
Mel filter bank MIDI filter bank
E4
E6
E8
Frequency (0-12 kHz)
7
29
61
Mel
filter
index
(1-80)
E4
E6
E8
Frequency (0-12 kHz)
64
88
112
MIDI
filter
index
(1-128)
120 233 346 460
Frame index
7
29
61
Dimension
index
(1-80)
Mel-spectrum
120 233 346 460
Frame index
64
88
112
Dimension
index
(1-128)
MIDI-fb-spectrum
Mel-spec. MIDI-fb-spec.
<latexit sha1_base64="6X1VLiCNlG3P2EJtZGp3iOqzN64=">AAACl3icbZFbaxNBFMcn663GW6pPIsJgEOqDcTfEWh/EgiB9bMG0hewazszOJkPmssycbQ3LPvlpfNVP47dxcnlw2x4Y5s/vnMO5sVJJj3H8txPdun3n7r2d+90HDx89ftLbfXrqbeW4GHOrrDtn4IWSRoxRohLnpROgmRJnbPFl5T+7EM5La77hshSZhpmRheSAAU17Lwv6iQ6/13uLt/sf37yjybChKUotPB2N4mmvHw/itdHrItmKPtna8XS3w9Lc8koLg1yB95MkLjGrwaHkSjTdtPKiBL6AmZgEaSBUyur1HA19HUhOC+vCM0jX9P+MGrT3S81CpAac+6u+FbzJN6mwOMhqacoKheGbQkWlKFq6WgrNpRMc1TII4E6GXimfgwOOYXXd1IhLbrUGk9cp880kycJvVb7qxaq6nzStwTbjIFM30DaaOSjnkv9oU2btAiEUalFdKZTOXjbdcJXk6g2ui9PhIHk/iE9G/cOD7X12yAvyiuyRhHwgh+SIHJMx4eQn+UV+kz/R8+hz9DU62oRGnW3OM9Ky6OQfqXrMdw==</latexit>
f = 2(k 69)/12
⇥ 440
<latexit sha1_base64="86+t9jn0UkxzHxA2if7/PbZ7/JQ=">AAACeXicbZHNThtBDMcn2xbSUGhoj71sGyEhDtEuLSrHSL30CFIDSMkKeWa9ySjzsZrxlkarfYJe24fjWbh08nHoApas+etnW7bHvFTSU5Lcd6IXL1/t7HZf9/be7B+87R++u/K2cgLHwirrbjh4VNLgmCQpvCkdguYKr/ni2yp+/ROdl9b8oGWJmYaZkYUUQAFdFrf9QTJM1hY/FelWDNjWLm4PO3yaW1FpNCQUeD9Jk5KyGhxJobDpTSuPJYgFzHASpAGNPqvXkzbxUSB5XFgX3FC8pv9X1KC9X2oeMjXQ3D+OreBzsUlFxXlWS1NWhEZsGhWVisnGq7XjXDoUpJZBgHAyzBqLOTgQFD6nNzV4J6zWYPJ6yn0zSbPwWpWvZrGqHqRNa7HNOsTVM7SNZg7KuRS/2pRbuyAIjVpUV4qks3dNL1wlfXyDp+LqdJieDZPLL4PR+fY+XfaBfWLHLGVf2Yh9ZxdszARD9pv9YX87D9HH6Dg62aRGnW3Ne9ay6PM/WXrEyA==</latexit>
f
<latexit sha1_base64="JbTkpy7RgEzAoI7aVLnTtzC7Ygk=">AAACeXicbZHNThtBDMcn2xZoKBDaYy9LIyTEIdqFonJE4tIjSA0gJSvkmXWSUeZjNeMtjVb7BFzh4XgWLp18HFjAkjV//WzL9pgXSnpKkqdW9OHjp7X1jc/tzS9b2zud3a9X3pZOYF9YZd0NB49KGuyTJIU3hUPQXOE1n57P49d/0XlpzR+aFZhpGBs5kgIooMvpbaeb9JKFxW9FuhJdtrKL290WH+ZWlBoNCQXeD9KkoKwCR1IorNvD0mMBYgpjHARpQKPPqsWkdbwfSB6PrAtuKF7QlxUVaO9nmodMDTTxr2Nz+F5sUNLoNKukKUpCI5aNRqWKycbzteNcOhSkZkGAcDLMGosJOBAUPqc9NHgnrNZg8mrIfT1Is/Balc9nsarqpnVjseU6xNU7tInGDoqJFP+alFs7JQiNGlSXiqSzd3U7XCV9fYO34uqol570ksuf3bPT1X022Hf2gx2wlP1iZ+w3u2B9Jhiye/bAHlvP0V50EB0uU6PWquYba1h0/B9j48TN</latexit>
k
Librosa.midi_to_hz
11. 11
Methods
Blank lines in MIDI-fb-spec. due to frequency resolution of FFT (see appendix)
q Acoustic features
1. Mel-spectrogram
2. MIDI-filterbank-spectrogram
Acoustic
model
Waveform
model
Wav
MIDI
API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
120 233 346 460
Frame index
16
32
48
64
80
96
112
128
Pinao
roll
index
(1-128)
Piano roll
Mel filter bank MIDI filter bank
E4
E6
E8
Frequency (0-12 kHz)
7
29
61
Mel
filter
index
(1-80)
E4
E6
E8
Frequency (0-12 kHz)
64
88
112
MIDI
filter
index
(1-128)
120 233 346 460
Frame index
7
29
61
Dimension
index
(1-80)
Mel-spectrum
120 233 346 460
Frame index
64
88
112
Dimension
index
(1-128)
MIDI-fb-spectrum
Mel-spec. MIDI-fb-spec.
12. 12
Methods
Zhao, Y., Wang, X., Juvela, L. & Yamagishi, J. Transferring neural speech waveform synthesizers to musical instrument sounds generation. in
Proc. ICASSP 6269–6273 (IEEE, 2020). doi:10.1109/ICASSP40776.2020.9053047
q Waveform model
§ Based on music neural source-filter (NSF) model (Zhao 2020) but
• No harmonic-plus-noise structure
Acoustic
model
Waveform
model
Wav
MIDI
API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
Bi-LSTM 1D CNN Up-sampling
Acoustic
features
Excitation
signal
Block
5
Wav
…
FC
Dilated
1D conv
Dilated
1D conv FC
Block 1
…
Condition module
Neural filter module
13. 13
Experiments
Hawthorne, C. et al. Enabling factorized piano music modeling and generation with the MAESTRO dataset. in Proc. ICLR (2018).
https://piano-e-competition.com/
q Database: MAESTRO v2.0 (Hawthorne 2018)
§ Real piano performances in International Piano-e-Competition
§ MIDI was recorded simultaneously during performance
• Aligned audio & piano roll
§ For experiments:
• Follow official data split
• 24kHz, 16 bits PCM
Data split of MAESTRO
From https://magenta.tensorflow.org/datasets/maestro#v200
21. 21
Messages
q TTS & MIDI-to-audio
§ Techniques can be shared: acoustic model, waveform model
§ Performance bottleneck on acoustic model (and waveform model)
q On waveform modeling
§ Physical-model performs well but lacks reverberation effect
§ Sample-based model replies on the sample database
§ Non-AR waveform model is OK in copy-synthesis
• Reverberation is captured
• Noise excitation is OK
22. 22
Messages
q On acoustic model
§ Obtaining good alignments for longer input sequences is challenging
§ Inputting the piano-roll frame to the decoder prenet helps improve alignments
• Acceptable for perfectly-aligned performance-MIDI
• Have to consider other strategies for non-aligned score-MIDI
27. 28
Appendix
q On MIDI filterbank
Natural
Mel-
spectrogram
to wave
MIDI-
centered
filter-bank
CQT
Short
clip 1
Short
clip 2
Short
clip 3
Short
clip 4
28. 29
Appendix
https://www.fluidsynth.org/
https://www.modartt.com/pianoteq
q Models in comparison
<latexit sha1_base64="QvEzsRPQOsJO2uPs95Rt4gD1HG8=">AAAM5XiclVbdbuNEFPYuC9mGAFu45MaiYrUg2bId56d3i1i6RNotQaVppLqqxpNJYtX2BM9428jyI3CHkOCGV+EW3oC34cxMksY/SYVTy9PxOd985ztzzthfhAHjlvXvo8fvPXn/g8bTg+aHrY8+/uTZ4acjRtMEk3NMQ5qMfcRIGMTknAc8JONFQlDkh+TCv/lWvL94RxIW0PgnvlyQqwjN4mAaYMRh6vqw0fZ8MgvijCM/DVGSZ3h15U2P00WShqTpRWnIg4TeZk6efZ1nZ0vGSaQPXuX6c7380mNzmnDGEb7JvsE0ZTzAuufpEZ2QMP8/HlOCeJqQB32+u8MBN4UHC2YxeniRC/SOmHWcQM80ioUHzrNhwPFcjwIWIRg8hPn2hzOB+CIiKP4qzz2v6WGRlKxr9PKmXnM9lz/xB3dMOYEHBsCJWAnco2Ai1T8VKqAQZo0dt2tabsHFY0s2HozKcbl5FkL2ULQIiSF2zUR/O3g1MDg1UDoJqM7olN8ikFyw6piOJZ5ds9cTFNtmtytXEeA7sIfzJYO9FRpS2X3ortl3FHpHrJK5ptPJi0Egn8kYRIgwGRhTH0YMRIXH6dmJFE5poPj1exJA+A0qfjEN2Max6Nfb8ts4ktBgC2wWV9z2c8yes/Greu1Zr29XsjVSK3uc3HHOoRoxdfJdcR+AbiBY186Vgm1X0jnurdMzeghsm9yBzAbkVmF1JUW7v0m1XoJq7+aloHp9BeX2FdTxGmow3g9VZdU/Xu2RdglqXInQ3SsXQK0DPFZi3Qc4KEfoPkQLlFYROgqrcy88TAxPTgnfrZFUqWN2ZGg903ZVaNZGpVEtRoHFGkRJ3TMdlTXrnsg2htyUpLyXNyAq9cBERGWLulwzGQx2wJTIKC7uiktHidLeypWUeA20BhEl8fqN5AEIHdNtq7bYheTIKdvsOpVK2VSoUS5NSEx71VNcS1Vof5Plba9SaYKfu+p0Vl+6dRV3n3JOI7UyiSebM/L62ZFlWvLSqwN7NTjSVtfw+vDJ796E4jQiMcchYuzSthb8KkMJ9M+QwFGbMrKAYwTNyCUMYxQRdpXJ4z3Xv0xFp57SBO6Y63J22yNDEWPLyAdLOKrmrPxOTNa9u0z5tH+VBfEi5STGaqFpGuqc6uJbQZ8ECcE8XMIA4SQQxzKeowRhDl8UTS8mt5hGEQJlPJ/ll/YVPGk4EVxomB3ZeSEwFQ73w5rZ4tQsQYt5gO+Ksz6lN5ABVpxdH8l5s0hINGU4oVdVDf8ZYv9C5gWpsmHZcurvtKwBhU1YZzqqgtZYiq15byjLHZY2YjatsyybAuBO0y3bxTQmfJcAsuVUbOslEJ2lDrY+sFGVwk4RRtt85Ym1j8Og3nqHvuOibXsPcq1pPey4RMHdB1vm6+4FruZO9Ezj9Zta662MnIRpAPUXQ8HXWG4ZDgMkPjl/zkUzs8utqzoYOabdMa0f3aOX/VVbe6p9rn2hvdBsrae91L7Xhtq5hht/NP5q/N34pzVr/dL6tfWbMn38aOXzmVa4Wn/+B5czF8c=</latexit>
System ID
Acoustic
model
Acoustic
feature
Excit.
signal
Wave.
model
Pitch mismatch MOS
(mean)
note chord
Natural - - - - - - 4.04
Fluidsynth Sample-based MIDI-to-audio software 5.20 6.77 3.66
Pianoteq Physical-model MIDI-to-audio software 4.82 6.50 4.25
abs-mfb-sin - midi-fb sine NSF - - 3.87
abs-mfb-noi - midi-fb noise NSF - - 3.77
abs-mel-sin - mel-spc. sine NSF - - 2.72
abs-mel-noi - mel-spc. noise NSF - - 3.81
taco2-mfb-sin taco2 midi-fb sine NSF 4.61 6.34 2.97
taco2-mfb-noi taco2 midi-fb noise NSF 4.66 6.36 3.18
taco3-mfb-sin taco3 midi-fb sine NSF 4.78 6.48 3.19
taco3-mfb-noi taco3 midi-fb noise NSF 4.89 6.53 3.19
taco4-mfb-sin taco4 midi-fb sine NSF 4.86 6.39 2.98
taco4-mfb-noi taco4 midi-fb noise NSF 4.97 6.42 2.95
pfnet-mfb-sin PFNet midi-fb sine NSF 5.59 7.14 3.10
pfnet-mfb-noi PFNet midi-fb noise NSF 5.78 7.26 3.05
pfnet-mel-sin PFNet mel-spec. sine NSF 5.66 7.17 1.82
pfnet-mel-noi PFNet mel-spec. noise NSF 5.74 7.25 2.93
pfnet-spec-GL PFNet spec. - GL 5.43 6.98 1.62
midi-sin-nsf - - sine NSF 4.32 6.40 2.88
midi-noi-nsf - - noise NSF 4.40 6.08 2.63
MIDI-to-audio
Wav
Acoustic
model
NSF
MIDI
API
Sine
Wav
Acoustic
model
NSF
MIDI
API
Noise
MIDI-to-audio
Wav
NSF
MIDI
API
Sine
Wav
NSF
MIDI
API
Noise
31. 32
Appendix
Caetano, Marcelo, and Xavier Rodet. "A source-filter model for musical instrument sound transformation." ICASSP. IEEE, 2012.
Klapuri, Anssi, Tuomas Virtanen, and Toni Heittola. "Sound source separation in monaural music signals using excitation-filter model and em algorithm."
ICASSP. IEEE, 2010.
q On music audio and speech waveform
32. 33
Appendix
Caetano, Marcelo, and Xavier Rodet. "A source-filter model for musical instrument sound transformation." ICASSP. IEEE, 2012.
Klapuri, Anssi, Tuomas Virtanen, and Toni Heittola. "Sound source separation in monaural music signals using excitation-filter model and em algorithm."
ICASSP. IEEE, 2010.
q On music audio and speech waveform