1) The document describes research presented by Hyeong-Seok Choi and Juheon Lee on conditional generative models for audio.
2) It provides examples of conditional generative models including vocoders for speech generation and singing voice synthesis models for generating singing from text and pitch inputs.
3) The researchers have worked on applications such as speech enhancement using generative models and audio-driven dance generation.
Deep Learning with Audio Signals: Prepare, Process, Design, ExpectKeunwoo Choi
Is deep learning Alchemy? No! But it heavily relies on tips and tricks, a set of common wisdom that probably works for similar problems. In this talk, I’ll introduce what the audio/music research societies have discovered while playing with deep learning when it comes to audio classification and regression -- how to prepare the audio data and preprocess them, how to design the networks (or choose which one to steal from), and what we can expect as a result.
인공지능의 음악 인지 모델 - 65차 한국음악지각인지학회 기조강연 (최근우 박사)Keunwoo Choi
The document discusses artificial intelligence models for music perception. It summarizes the talk that analyzes and classifies music AI into analysis, creation, signal generation, and signal processing. Specifically, the analysis part is discussed in detail by dividing it into timbre, notes, and lyrics recognition. Through this, we can understand what music AI researchers aim for, assume, develop, neglect, and misunderstand.
This document summarizes deep learning research at Niland Music for music recommendation. It describes how Niland has moved from traditional music information retrieval techniques to deep learning approaches using convolutional neural networks. Key points include:
- CNNs trained on mel-spectrograms of songs can achieve similar or better results than complex hand-engineered features and pooling techniques.
- Simple pooling methods like mean, max and variance work well with CNNs, outperforming more complex approaches.
- Training on larger datasets of 150k+ tracks improves results over smaller datasets.
- Residual networks can further improve performance over plain convolutional networks.
- More data, data augmentation, and semi-supervised techniques may provide additional gains
This document discusses machine learning techniques for music information retrieval. It provides an overview of music recommendation systems like Spotify's shuffle mode and Pandora's music genome project. Key music information retrieval tasks are identified like genre recognition, mood detection, and audio similarity. Machine learning architectures for music information retrieval are examined including feature extraction from audio, classification with neural networks, and deep learning techniques like convolutional neural networks and autoencoders.
The document is a glossary of terms related to sound design and production for computer games created by Phillip Norris Wynne. It contains definitions for over 20 key terms sourced from online references. The terms cover areas such as sound design methodology, file formats, limitations on audio hardware, recording systems, MIDI and software tools. For each term, Phillip provides the online definition and assesses how the term relates to his own production practice. The glossary demonstrates Phillip's research into fundamental concepts and technologies that inform sound design for games.
Query By Humming - Music Retrieval TechniqueShital Kat
This seminar report summarizes query by humming technology. The basic architecture involves extracting melodic information from a hummed input, transcribing it, and comparing it to melodic contours in a database. Challenges include imperfect user queries and accurately capturing pitches from hums. Popular query by humming applications include Shazam, SoundHound, and Midomi. The report also discusses file formats like WAV and MIDI, and the Parsons code algorithm for representing melodies.
Deep Learning with Audio Signals: Prepare, Process, Design, ExpectKeunwoo Choi
Is deep learning Alchemy? No! But it heavily relies on tips and tricks, a set of common wisdom that probably works for similar problems. In this talk, I’ll introduce what the audio/music research societies have discovered while playing with deep learning when it comes to audio classification and regression -- how to prepare the audio data and preprocess them, how to design the networks (or choose which one to steal from), and what we can expect as a result.
인공지능의 음악 인지 모델 - 65차 한국음악지각인지학회 기조강연 (최근우 박사)Keunwoo Choi
The document discusses artificial intelligence models for music perception. It summarizes the talk that analyzes and classifies music AI into analysis, creation, signal generation, and signal processing. Specifically, the analysis part is discussed in detail by dividing it into timbre, notes, and lyrics recognition. Through this, we can understand what music AI researchers aim for, assume, develop, neglect, and misunderstand.
This document summarizes deep learning research at Niland Music for music recommendation. It describes how Niland has moved from traditional music information retrieval techniques to deep learning approaches using convolutional neural networks. Key points include:
- CNNs trained on mel-spectrograms of songs can achieve similar or better results than complex hand-engineered features and pooling techniques.
- Simple pooling methods like mean, max and variance work well with CNNs, outperforming more complex approaches.
- Training on larger datasets of 150k+ tracks improves results over smaller datasets.
- Residual networks can further improve performance over plain convolutional networks.
- More data, data augmentation, and semi-supervised techniques may provide additional gains
This document discusses machine learning techniques for music information retrieval. It provides an overview of music recommendation systems like Spotify's shuffle mode and Pandora's music genome project. Key music information retrieval tasks are identified like genre recognition, mood detection, and audio similarity. Machine learning architectures for music information retrieval are examined including feature extraction from audio, classification with neural networks, and deep learning techniques like convolutional neural networks and autoencoders.
The document is a glossary of terms related to sound design and production for computer games created by Phillip Norris Wynne. It contains definitions for over 20 key terms sourced from online references. The terms cover areas such as sound design methodology, file formats, limitations on audio hardware, recording systems, MIDI and software tools. For each term, Phillip provides the online definition and assesses how the term relates to his own production practice. The glossary demonstrates Phillip's research into fundamental concepts and technologies that inform sound design for games.
Query By Humming - Music Retrieval TechniqueShital Kat
This seminar report summarizes query by humming technology. The basic architecture involves extracting melodic information from a hummed input, transcribing it, and comparing it to melodic contours in a database. Challenges include imperfect user queries and accurately capturing pitches from hums. Popular query by humming applications include Shazam, SoundHound, and Midomi. The report also discusses file formats like WAV and MIDI, and the Parsons code algorithm for representing melodies.
The document is a glossary of terms related to sound design and production for computer games created by Phillip Norris Wynne. It contains definitions for over 20 key terms sourced from online references and describes the relevance of each term to the author's own production practice. Some of the terms defined and summarized include Foley artistry, sound libraries, audio file formats like .wav and .mp3, lossy compression, audio limitations of hardware like sound processor units, and audio sampling concepts like bit depth and sample rate.
The document provides definitions for various audio and sound design terms. It includes definitions for terms like Foley artistry, sound libraries, audio file formats like .wav and .mp3, audio limitations like RAM, and audio recording systems like analog, digital, Mini Disc, and compact disc. For each term, it provides a short definition from an online source as well as how the term relates to the author's own sound production practice.
The document is a glossary created by a student named Terence Byrne for a unit on sound design and production. It contains definitions for over 20 key terms related to sound design methodology, sound file formats, audio limitations, audio recording systems, and MIDI. For each term, the student provided a short definition from an online source along with a description of how the term relates to their own production practice.
This document contains a glossary of terms related to sound design and production. It provides definitions for terms like Foley artistry, sound libraries, audio file formats (WAV, AIFF, AU), lossy compression, MP3, sound processors, RAM, mono/stereo audio, surround sound, PCM, analog/digital audio recording, MIDI, software sequencers, plugins, and MIDI keyboards. For each term, it gives a short definition from an online source as well as a brief description of how the term relates to the student's own production work where possible.
This document provides a glossary of terms related to sound design and production for computer games. It defines terms such as foley artistry, sound libraries, file formats like .wav and .aiff, compression types, audio hardware limitations, and audio configurations like mono, stereo, and surround sound. For each term, it provides a short definition and links to external sources, as well as describing the relevance of the term to the document author's own production practice. The glossary is intended to research and gather definitions for provided terms as part of a BTEC course assignment on sound design for computer games.
The document is a glossary of terms related to sound design and production for computer games. It contains definitions for terms like Foley artistry, sound libraries, audio file formats like .wav and .mp3, audio limitations involving hardware, recording systems, sampling, and more. For each term, it provides a short definition from an online source as well as any relevance to the author's own production practice.
This document provides definitions for key terms related to sound design and production. It includes a glossary with over 20 terms defined, each with a short definition and URL reference. Examples of defined terms include Foley artistry, sound libraries, .wav and .mp3 file formats, sound processors, mono/stereo audio, MIDI, and sampling concepts like bit depth and sample rate. For most terms, the document also provides a one sentence description of how the term relates to the author's own production practice.
The document is a glossary assignment for a games design course requiring research and definitions of audio terms related to sound design and production. It includes over 20 terms defined with URLs for supporting information. For many terms, the student also provides details on how the term relates to their own production work, such as using MIDI keyboards and software to create and edit soundtracks, and having the option to output audio files in different formats like .wav and .mp3.
This document is a glossary assignment for a games design course requiring research and definitions of sound design and production terms. It includes over 20 terms defined with URLs citing their sources. For some terms, the student provides additional details on how the term relates to their own production practice, such as using compression to reduce file sizes or recording foley sounds.
Mono recording uses a single audio channel to record sound, resulting in smaller file sizes but lacking sound perspective. Stereo recording uses two audio channels to provide sound perspective but results in larger file sizes. Uncompressed file formats like .WAV files maintain high quality but are very large, while lossy compressed formats like MP3 sacrifice some quality for much smaller file sizes and portability.
This presentation gives you the basic idea about surround system. Various aspects to be considered while designing surround system and it's various formats.
This document contains a glossary of terms related to sound design and production for computer games. It provides definitions for terms like foley artistry, sound libraries, uncompressed audio file formats like .wav and .aiff, compressed formats like .mp3, audio hardware like sound processors and digital sound processors, audio limitations like mono and stereo sound, audio recording systems like analog, digital, compact discs, digital audio tape, and MIDI. It also defines software sequencers and plug-ins. For each term, it gives a short definition and references the source URL. It sometimes provides additional details on how the term relates to the author's own production practice.
The document is a glossary produced by a student named Terence Byrne for a unit on sound design and production. It contains definitions of over 20 key terms related to sound design methodology, file formats, audio limitations, audio recording systems, MIDI, software, audio sampling, and other topics. For each term, the student provides a short definition from an online source along with a description of how the term relates to their own production practice when possible.
Convolutional recurrent neural networks for music classificationKeunwoo Choi
The document describes an experiment comparing different convolutional and recurrent neural network architectures for music classification and tagging. Specifically, it compares models with 1D convolutions (k1c1, k1c2), 2D convolutions (k2c1, k2c2), and a convolutional recurrent neural network (CRNN). The CRNN and k2c2 models achieved the best performance while balancing complexity, though k2c1 was most computationally efficient. Performance varied across tags depending on factors like number of training examples and tag difficulty or ambiguity. The authors conclude the best structure depends on constraints but CRNN generally performed best when feasible.
This document discusses different multimedia elements including sound, animation, and video. It covers:
- Understanding how sound works through analog waves, digital sampling, and sound file formats. Sound can be added to multimedia through a sound card and editing software.
- 2D and 3D animation techniques and how they are used on the web. Animation can enhance multimedia titles.
- Video compression methods, editing, and embedding video on the web. The file size of video content needs to be decreased for web use.
- The document considers balancing multimedia elements with objectives, costs, and file sizes for different intended audiences and applications.
Hayden Parkes produced a glossary of key terms related to sound design and production for computer games. He provided definitions from online research for terms like foley artistry, sound libraries, file formats like .wav and .aiff, compression types, audio hardware like sound cards and MIDI keyboards, and concepts like sampling rates and bit depths. For each term, he described how the concept relates to his own work producing game audio, such as using uncompressed formats, stereo recording, and software plugins and sequencers.
Jordan Smith has produced a glossary of terms related to sound design and production for computer games. The glossary contains definitions for terms such as Foley artistry, sound libraries, audio file formats like .wav and .mp3, audio limitations including sound processor units and random access memory, audio recording systems like analog and digital, MIDI, software sequencers, and audio sampling concepts like bit depth and sample rate. For each term, Jordan has provided a researched definition from an online source as well as his own comments on the relevance of the term to his production practice where possible.
1) The document provides definitions for various sound design and production terms along with their relevance to the author's own production practice.
2) Terms defined include Foley artistry, sound libraries, uncompressed and lossy file formats like .wav and .mp3, digital sound processors, random access memory, mono/stereo/surround sound, pulse code modulation, analog vs digital recording, storage mediums like compact discs and MIDI, software like sequencers and plugins, and concepts like bit depth and sample rate.
3) For each term, the author provides a short internet definition and describes how the term relates to their own work creating music and sounds for video games.
This document provides a glossary of terms related to sound design and production for computer games. It includes definitions for terms like foley artistry, sound libraries, file formats like .wav and .mp3, audio limitations involving hardware like sound processor units, and audio recording systems like analog, digital discs, MIDI, and sampling concepts like bit depth and sample rate. For each term, the student provided an internet definition source and described how the term relates to their own production practice.
This document describes experiments using a variational autoencoder (VAE) to perform spectral voice conversion and style conversion. The VAE was able to produce high quality speech in a vocoding task and achieved good speaker accuracy in a voice conversion experiment. A style conversion from habitual to clear speech using the VAE feature representation and a DNN with skip connections significantly improved speech intelligibility for one speaker from 24% to 46%.
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...Yandex
This talk describes the work developed by the Yandex Speech Group in the last two years. Beginning from scratch, large amounts of voice recordings were collected from the field of application, and the most popular open source speech projects were studied to get a thorough understanding of the problem and to gather ideas to build our own technology. This talk will present key experiments and their results, as well as our latest achievements in automatic speech recognition in Russian.
Currently, the Yandex Speech Group provides three different services in Russian: maps, navigation, and general search, with a performance that is comparable to competitor products.
The document is a glossary of terms related to sound design and production for computer games created by Phillip Norris Wynne. It contains definitions for over 20 key terms sourced from online references and describes the relevance of each term to the author's own production practice. Some of the terms defined and summarized include Foley artistry, sound libraries, audio file formats like .wav and .mp3, lossy compression, audio limitations of hardware like sound processor units, and audio sampling concepts like bit depth and sample rate.
The document provides definitions for various audio and sound design terms. It includes definitions for terms like Foley artistry, sound libraries, audio file formats like .wav and .mp3, audio limitations like RAM, and audio recording systems like analog, digital, Mini Disc, and compact disc. For each term, it provides a short definition from an online source as well as how the term relates to the author's own sound production practice.
The document is a glossary created by a student named Terence Byrne for a unit on sound design and production. It contains definitions for over 20 key terms related to sound design methodology, sound file formats, audio limitations, audio recording systems, and MIDI. For each term, the student provided a short definition from an online source along with a description of how the term relates to their own production practice.
This document contains a glossary of terms related to sound design and production. It provides definitions for terms like Foley artistry, sound libraries, audio file formats (WAV, AIFF, AU), lossy compression, MP3, sound processors, RAM, mono/stereo audio, surround sound, PCM, analog/digital audio recording, MIDI, software sequencers, plugins, and MIDI keyboards. For each term, it gives a short definition from an online source as well as a brief description of how the term relates to the student's own production work where possible.
This document provides a glossary of terms related to sound design and production for computer games. It defines terms such as foley artistry, sound libraries, file formats like .wav and .aiff, compression types, audio hardware limitations, and audio configurations like mono, stereo, and surround sound. For each term, it provides a short definition and links to external sources, as well as describing the relevance of the term to the document author's own production practice. The glossary is intended to research and gather definitions for provided terms as part of a BTEC course assignment on sound design for computer games.
The document is a glossary of terms related to sound design and production for computer games. It contains definitions for terms like Foley artistry, sound libraries, audio file formats like .wav and .mp3, audio limitations involving hardware, recording systems, sampling, and more. For each term, it provides a short definition from an online source as well as any relevance to the author's own production practice.
This document provides definitions for key terms related to sound design and production. It includes a glossary with over 20 terms defined, each with a short definition and URL reference. Examples of defined terms include Foley artistry, sound libraries, .wav and .mp3 file formats, sound processors, mono/stereo audio, MIDI, and sampling concepts like bit depth and sample rate. For most terms, the document also provides a one sentence description of how the term relates to the author's own production practice.
The document is a glossary assignment for a games design course requiring research and definitions of audio terms related to sound design and production. It includes over 20 terms defined with URLs for supporting information. For many terms, the student also provides details on how the term relates to their own production work, such as using MIDI keyboards and software to create and edit soundtracks, and having the option to output audio files in different formats like .wav and .mp3.
This document is a glossary assignment for a games design course requiring research and definitions of sound design and production terms. It includes over 20 terms defined with URLs citing their sources. For some terms, the student provides additional details on how the term relates to their own production practice, such as using compression to reduce file sizes or recording foley sounds.
Mono recording uses a single audio channel to record sound, resulting in smaller file sizes but lacking sound perspective. Stereo recording uses two audio channels to provide sound perspective but results in larger file sizes. Uncompressed file formats like .WAV files maintain high quality but are very large, while lossy compressed formats like MP3 sacrifice some quality for much smaller file sizes and portability.
This presentation gives you the basic idea about surround system. Various aspects to be considered while designing surround system and it's various formats.
This document contains a glossary of terms related to sound design and production for computer games. It provides definitions for terms like foley artistry, sound libraries, uncompressed audio file formats like .wav and .aiff, compressed formats like .mp3, audio hardware like sound processors and digital sound processors, audio limitations like mono and stereo sound, audio recording systems like analog, digital, compact discs, digital audio tape, and MIDI. It also defines software sequencers and plug-ins. For each term, it gives a short definition and references the source URL. It sometimes provides additional details on how the term relates to the author's own production practice.
The document is a glossary produced by a student named Terence Byrne for a unit on sound design and production. It contains definitions of over 20 key terms related to sound design methodology, file formats, audio limitations, audio recording systems, MIDI, software, audio sampling, and other topics. For each term, the student provides a short definition from an online source along with a description of how the term relates to their own production practice when possible.
Convolutional recurrent neural networks for music classificationKeunwoo Choi
The document describes an experiment comparing different convolutional and recurrent neural network architectures for music classification and tagging. Specifically, it compares models with 1D convolutions (k1c1, k1c2), 2D convolutions (k2c1, k2c2), and a convolutional recurrent neural network (CRNN). The CRNN and k2c2 models achieved the best performance while balancing complexity, though k2c1 was most computationally efficient. Performance varied across tags depending on factors like number of training examples and tag difficulty or ambiguity. The authors conclude the best structure depends on constraints but CRNN generally performed best when feasible.
This document discusses different multimedia elements including sound, animation, and video. It covers:
- Understanding how sound works through analog waves, digital sampling, and sound file formats. Sound can be added to multimedia through a sound card and editing software.
- 2D and 3D animation techniques and how they are used on the web. Animation can enhance multimedia titles.
- Video compression methods, editing, and embedding video on the web. The file size of video content needs to be decreased for web use.
- The document considers balancing multimedia elements with objectives, costs, and file sizes for different intended audiences and applications.
Hayden Parkes produced a glossary of key terms related to sound design and production for computer games. He provided definitions from online research for terms like foley artistry, sound libraries, file formats like .wav and .aiff, compression types, audio hardware like sound cards and MIDI keyboards, and concepts like sampling rates and bit depths. For each term, he described how the concept relates to his own work producing game audio, such as using uncompressed formats, stereo recording, and software plugins and sequencers.
Jordan Smith has produced a glossary of terms related to sound design and production for computer games. The glossary contains definitions for terms such as Foley artistry, sound libraries, audio file formats like .wav and .mp3, audio limitations including sound processor units and random access memory, audio recording systems like analog and digital, MIDI, software sequencers, and audio sampling concepts like bit depth and sample rate. For each term, Jordan has provided a researched definition from an online source as well as his own comments on the relevance of the term to his production practice where possible.
1) The document provides definitions for various sound design and production terms along with their relevance to the author's own production practice.
2) Terms defined include Foley artistry, sound libraries, uncompressed and lossy file formats like .wav and .mp3, digital sound processors, random access memory, mono/stereo/surround sound, pulse code modulation, analog vs digital recording, storage mediums like compact discs and MIDI, software like sequencers and plugins, and concepts like bit depth and sample rate.
3) For each term, the author provides a short internet definition and describes how the term relates to their own work creating music and sounds for video games.
This document provides a glossary of terms related to sound design and production for computer games. It includes definitions for terms like foley artistry, sound libraries, file formats like .wav and .mp3, audio limitations involving hardware like sound processor units, and audio recording systems like analog, digital discs, MIDI, and sampling concepts like bit depth and sample rate. For each term, the student provided an internet definition source and described how the term relates to their own production practice.
This document describes experiments using a variational autoencoder (VAE) to perform spectral voice conversion and style conversion. The VAE was able to produce high quality speech in a vocoding task and achieved good speaker accuracy in a voice conversion experiment. A style conversion from habitual to clear speech using the VAE feature representation and a DNN with skip connections significantly improved speech intelligibility for one speaker from 24% to 46%.
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...Yandex
This talk describes the work developed by the Yandex Speech Group in the last two years. Beginning from scratch, large amounts of voice recordings were collected from the field of application, and the most popular open source speech projects were studied to get a thorough understanding of the problem and to gather ideas to build our own technology. This talk will present key experiments and their results, as well as our latest achievements in automatic speech recognition in Russian.
Currently, the Yandex Speech Group provides three different services in Russian: maps, navigation, and general search, with a performance that is comparable to competitor products.
Hindi digits recognition system on speech data collected in different natural...csandit
This paper presents a baseline digits speech recognizer for Hindi language. The recording environment is different for all speakers, since the data is collected in their respective homes. The different environment refers to vehicle horn noises in some road facing rooms, internal background noises in some rooms like opening doors, silence in some rooms etc. All these recordings are used for training acoustic model. The Acoustic Model is trained on 8 speakers’ audio data. The vocabulary size of the recognizer is 10 words. HTK toolkit is used for building
acoustic model and evaluating the recognition rate of the recognizer. The efficiency of the recognizer developed on recorded data, is shown at the end of the paper and possible directions for future research work are suggested.
This document describes a multi-task adversarial training algorithm for improving the performance of multi-speaker neural text-to-speech (TTS) models, especially for voice cloning of unseen speakers. The proposed method augments GANSpeech training with an autoencoder interpolation regularization technique to diversify speaker variations during training. Evaluation shows the method generates higher quality synthesized speech for seen speakers compared to baselines, and also improves voice cloning for unseen speakers by reducing quality degradation. However, there is still a gap in speaker similarity between seen and unseen speakers, indicating room for further improvement.
Dolby audio ai workshop speech coding - cong zhouAnkit Shah
This document discusses SampleRNN, a neural audio generation model that can generate high-quality speech. SampleRNN uses a multi-rate recurrent neural network architecture with learned upsampling to directly model raw audio waveforms. It can be conditioned on vocoder features to allow for high-quality speech coding at bitrates as low as 6.4 kbps, significantly lower than traditional speech codecs. Experimental results show SampleRNN achieves better quality than existing codecs like AMR-WB at comparable or lower bitrates. Future work may focus on improving SampleRNN's robustness and reducing its computational complexity.
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsForward Gradient
The document outlines an AI and NLP seminar, including three parts: natural language processing, speech, and introduction. Part II on NLP covers topics like word representations, sentence representations, NLP benchmarks, multilingual representations, and applications of text and graph embeddings. Part III on speech discusses speech recognition approaches and multimodal speech and text for emotion recognition.
Machine learning for creative AI applications in music (2018 nov)Yi-Hsuan Yang
An up-to-date overview of our recent research on music/audio and AI. It contains four parts:
* AI Listener: source separation (ICMLA'18a) and sound event detection (IJCAI'18)
* AI DJ: music thumbnailing (TISMIR'18) and music sequencing (AAAI'18a)
* AI Composer: melody generation (ISMIR'17), lead sheet generation (ICMLA'18b), multitrack pianoroll generation (AAAI'18b), and instrumentation generation (arxiv)
* AI Performer: CNN-based score-to-audio generation (AAAI'19)
This document describes the process of creating a large-scale audio-visual dataset of celebrity speakers from YouTube videos, called VoxCeleb. Face detection and tracking were used to extract audio segments where a detected face was speaking. Face verification then identified which celebrity the face matched. Over 1,200 identities were included, with over 100,000 video clips extracted through an automated pipeline. The dataset enables research in audio-visual speech recognition and speaker identification in unconstrained conditions.
ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)Yi-Hsuan Yang
This document provides an overview and outline of a tutorial on music generation with generative adversarial networks (GANs). The tutorial will begin with an introduction to music generation research and GANs. It will include coding sessions to demonstrate GANs for image generation. Case studies of GAN-based music generation systems will then be presented, including symbolic melody generation, arrangement generation, and style transfer. Current limitations and future research directions will also be discussed. The document lists the speakers and their backgrounds and affiliations in music and artificial intelligence research.
Mining the social web for music-related data: a hands-on tutorialBen Fields
The document describes a tutorial on mining the social web for music-related data using hands-on coding examples. The tutorial covers evaluating hypotheses by comparing genres of lyrics retrieved from various music APIs, revealing trends by analyzing friends' listening habits on Last.fm, performing audio analysis by estimating song tempo through a web API, and capturing social data from a music sharing platform. The document emphasizes that the web provides both musical data and tools for music information retrieval and recommends experimenting with different programming languages and APIs.
Speech recognition technology allows users to communicate through spoken commands. It works by converting acoustic speech signals captured by a microphone into text. There are two main types of speech models - speaker independent models that can recognize many people, and speaker dependent models customized for a single person. The speech recognition process involves an audio input being digitized, then broken down into phonemes which are statistically modeled and matched to words in a grammar according to a dictionary to output recognized text.
This document summarizes a comparison of six sound analysis/synthesis systems conducted at the 2000 International Computer Music Conference. Each system analyzed the same set of 27 varied input sounds and output the results in a common format (SDIF). The comparison describes each system, compares them in terms of availability, sound models used, interpolation models, noise modeling, parameter mutability, required analysis parameters, and artifacts. The goal was not competition but providing information to help musicians choose appropriate analysis/synthesis tools.
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET Journal
1. The document describes a study that uses a Convolutional Neural Network (CNN) model to classify music genres based on labeled Mel spectrograms of audio clips.
2. A CNN model is trained on a dataset of 1000 audio clips across 10 genres. The trained model is then used to classify new, unlabeled audio clips by genre based on their Mel spectrogram representation.
3. CNNs are well-suited for this task as their convolutional layers can extract hierarchical features from the Mel spectrogram images that are indicative of different genres. The study aims to develop an automated music genre classification system using deep learning techniques.
The document discusses developing a model to compose monophonic world music using deep learning techniques. It proposes using a bi-axial recurrent neural network with one axis representing time and the other representing musical notes. The network will be trained on a dataset of MIDI files describing pitch, timing, and velocity of notes. It will also incorporate information from music theory on scales, chords, and other elements extracted from sheet music files. The goal is to generate unique musical sequences while adhering to music theory rules. The model aims to address the problem of composing long durations of background music for public spaces in an automated way.
The document discusses developments of resources for an automatic speech recognition (ASR) system for the Swahili language. It describes the challenges in developing acoustic models, pronunciation dictionaries, language models and audio corpora for Swahili ASR due to its rich morphology and the lack of existing resources. It details several approaches taken to collect audio data and generate transcriptions, including read speech, crowdsourcing, and collaborative efforts with local institutes. The goal is to progressively develop better ASR models through iterative training and transcription of additional audio data.
https://telecombcn-dl.github.io/dlmm-2017-dcu/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015RIILP
The document discusses using syntactic preordering models to delimit the morphosyntactic search space for machine translation of morphologically rich languages. It explores preordering dependency trees of the source language to reduce word order variations and predicting morphological attributes on the source side to inform target language word selection. Experimental results show that non-local features and jointly learning which attributes to predict can improve translation performance over baselines. The work aims to combine preordering and morphology prediction to better exploit interactions between syntactic structure and inflectional properties.
Deep Learning & NLP: Graphs to the Rescue!Roelof Pieters
This document provides an overview of deep learning and natural language processing techniques. It begins with a history of machine learning and how deep learning advanced beyond early neural networks using methods like backpropagation. Deep learning methods like convolutional neural networks and word embeddings are discussed in the context of natural language processing tasks. Finally, the document proposes some graph-based approaches to combining deep learning with NLP, such as encoding language structures in graphs or using finite state graphs trained with genetic algorithms.
This document provides an overview of a research talk on human-in-the-loop speech synthesis technology given by Yuki Saito from the University of Tokyo. The talk was organized in two parts, with the first part presented by Saito covering human-in-the-loop deep speaker representation learning and speaker adaptation for multi-speaker text-to-speech. Saito's research group at the University of Tokyo works on text-to-speech and voice conversion using deep learning techniques. Their recent work focuses on incorporating human listeners into the training process to learn speaker representations that better capture perceptual speaker similarity.
Similar to Conditional generative model for audio (20)
This document summarizes a presentation on audio technologies for virtual reality given by Ben Sangbae Chon, Chief Science Officer of Gaudio Lab. The presentation covered:
- An overview of Gaudio Lab, which develops spatial audio solutions for virtual reality.
- Examples of immersive audio content created by Gaudio Lab, including VR games, 360 videos, and livestreams.
- The importance of interactive and positional audio for virtual reality, as viewers can look in any direction.
- A history of binaural recording technologies dating back to the late 19th century, and how modern binaural rendering works by convolving source audio with head-related impulse responses.
The effects of noisy labels on deep convolutional neural networks for music t...Keunwoo Choi
The document investigates the effects of noisy labels when training convolutional neural networks for music tagging, finding that tags with higher "taggability" due to being more unusual have less noisy labels which leads to better model performance; it also analyzes what the model learns by examining label vectors and their relationship to co-occurrence in the ground truth data.
This document is a tutorial on using deep learning for music information retrieval. It introduces deep learning and discusses how it can be applied to various MIR problems like tempo, key, onset/offset detection, and chord recognition. It covers deep learning concepts like loss functions, training networks, and activation functions. It provides examples of using dense, convolutional and recurrent layers on spectrograms for pitch detection and chord recognition. It also offers advice on audio preprocessing and model suggestions. The tutorial aims to help readers dive deeper into applying deep learning to MIR and provides code examples on GitHub.
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016Keunwoo Choi
This document summarizes research on using deep convolutional neural networks for automatic music tagging. It describes the problem of automatic tagging, proposed architectures using convolutional and max pooling layers, and experiments on two datasets. The experiments showed that melgram representations with 4 convolutional layers achieved the best results, and deeper models did not significantly improve performance. Re-running the experiments on the MSD dataset with proper hyperparameter tuning yielded improved results over those originally reported.
Deep Convolutional Neural Networks - OverviewKeunwoo Choi
The document provides an overview of convolutional neural networks (CNNs) including their structures and applications. CNNs use locally connected, shared weights and convolutional layers to learn hierarchical representations of input data. They have been successfully applied to tasks involving images and music such as visual recognition, segmentation, style transfer, tagging, chord recognition and onset detection.
Deep learning for music classification, 2016-05-24Keunwoo Choi
This document describes a presentation on deep learning for music classification. It discusses using deep convolutional neural networks (CNNs) for music classification tasks like genre classification, instrument identification, and automatic music tagging. CNNs can learn hierarchical music features from raw audio or time-frequency representations directly from data without requiring designed features. The presentation provides examples of applying CNNs to automatically tag music with descriptive keywords using a multi-label classification approach.
2015-05-09 키스텝에서 진행한 딥러닝 개요입니다.
짧은 분량이지만 세미나는 매우 인터랙티브하게 진행되어 두시간을 꽉 채웠던 슬라이드입니다.
다시 말해 슬라이드만 보시면 부족한 부분이 많이 있으니 참고하시기 바랍니다.
8페이지에 6개의 텐서플로 플레이그라운드 데모를 연결해두었습니다. 링크 눌러보시고 직접 돌려보시면 뉴럴넷에 대해 쉽게 이해하실 수 있을겁니다.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365.
Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
2. 최형석
Hyeong-Seok Choi
kekepa15@snu.ac.kr
이주헌
Juheon Lee
juheon2@snu.ac.kr
Affiliation
Seoul National University
Music & Audio Research Group
Research interest
Audio Source Separation
Speech Enhancement
Self-supervised representation learning &
generation
Singing Voice Synthesis
Affiliation
Seoul National University
Music & Audio Research Group
Research interest
Singing Voice Synthesis
Lyric-to-audio Alignment
Cover Song Identification
Abnormal Sound Detection
Choreography Generation
5. 5
Generative models
Explicit models: infer the parameters of 𝑝 𝑿; 𝜽 . (i.e., how likely is this cat?)
X
𝑝(𝑿; 𝜽)
𝑝(𝑿; 𝜽)
VAE, Autoregressive models, …
6. 6
Generative models
Implicit models: I don’t care about the parameters, just give me some nice cats when I
roll the dice! (sampling)
X
𝑝(𝑿; 𝜽)
GANs…
7. 7
Generative models
Implicit models: I don’t care about the parameters, just give me some nice cats when I
roll the dice! (sampling)
X
𝑝(𝑿; 𝜽)
GANs…
8. 8
Conditional generative models
Application dependent modeling
1. Given a piano roll, I want to generate an expressive piano performance
2. Given a mel-spectrogram, I want to generate a raw audio signal
3. Given a linguistic feature, I want to generate a speech signal
…
Generative
Model
Output
1. Signal
Condition
1. Controllability
9. 9
Conditional generative models
What does conditional generative model do?
Reconstruct a signal from a given information (filling in the missing
information)
Level of “missing information”? (In music&audio point of view)
Condition Abstract Level
Abstract (Sparse)
Realistic (Dense)
Instrument class
Sound class
Non-expressive score
Linguistic Feature
Audio features
(mel-spectrogram)
MIDI score w/ velocity and etc…
Linguistic Feature w/ pitch
15. 15
Conditional generative models: applications
Example of densely conditioned models: Vocoders
Practical/interesting application of vocoders: Generative speech enhancement
1. Parametric Resynthesis with Neural Vocoders (Waspaa2019)
2. Generative Speech Enhancement Based on Cloned Networks (Waspaa2019)
3. Speaker independence of neural vocoders and their effect on parametric resynthesis
speech enhancement (arxiv, 2019)
4. A Speech Synthesis Approach for High Quality Speech Separation and Generation
(IEEE Signal processing letters, 2019)
Key idea: Ensemble the power of discriminative & generative approach!
Pros: Almost no artifacts
Cons: Inaccurate pronunciation in low SNR condition
Separator
Synthesizer
(Vocoders)
Noisy mel-spectrogram Estimated clean mel-spectrogram
Discriminative Generative
Synthesized clean raw wave
16. 16
Conditional generative models: applications
Example of densely conditioned models: Vocoders
Practical/interesting application of vocoders: Generative speech enhancement
• Parametric Resynthesis with Neural Vocoders (Waspaa2019)
• Generative Speech Enhancement Based on Cloned Networks (Waspaa2019)
• Speaker independence of neural vocoders and their effect on parametric resynthesis
speech enhancement (arxiv, 2019)
• A Speech Synthesis Approach for High Quality Speech Separation and Generation (IEEE
Signal processing letters, 2019)
Key idea: Ensemble the power of discriminative & generative approach!
Pros: Almost no artifacts
Cons: Inaccurate pronunciation in low SNR condition
Separator
Synthesizer
(Vocoders)
Noisy mel-spectrogram Estimated clean mel-spectrogram
Discriminative Generative
Synthesized clean raw wave
Some of my preliminary results…
Noisy
Generated
17. 17
Conditional generative models: applications
Example of densely conditioned models: Vocoders
• Some other practical/interesting application: Next generation codec
1. Wavenet based low rate speech coding (ICASSP 2018)
2. Low bit-rate speech coding with vq-vae and a wavenet decoder (ICASSP 2019)
3. Improving opus low bit rate quality with neural speech synthesis (arxiv, 2019)
Key idea:
1. Deep learning is good at learning a compressed representation (Encoder).
2. Deep learning is good at synthesizing (Decoder).
Pros: Good bit rate (bps)
Cons: ???
Encoder
Server1
Compressed
representation
Decoder
Server2
Reconstructed
signal (speech)
18. 18
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Training stage
19. 19
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
TEXT MIDI
Conditioned wave
Generation stage
20. 20
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Main Idea : Disentangling Formant mask & Pitch skeleton
• We wanted pitch and text information to be modelled as independent
acoustic features, and we designed the network to reflect that
21. 21
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text : “do re mi fa sol ra ti do”
Input pitch : [C D E F G A B C]
Generated audio :
FormantmaskPitchskeletonGeneratedmelspec.
22. 22
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text : “do do do do do do do do”
Input pitch : [C D E F G A B C]
Generated audio :
FormantmaskPitchskeletonGeneratedmelspec.
23. 23
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text : “do re mi fa sol ra ti do”
Input pitch : [C C C C C C C C]
Generated audio :
FormantmaskPitchskeletonGeneratedmelspec.
24. 24
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text “아리랑 아리랑 아라리오 아리랑 고개로 넘어간다 나를 버리고 가시는 님은 십리도 못 가서 발병 난다”
“arirang arirang arario arirang go gae ro neom eo gan da na reul beo ri go ga shi neun nim eun sib ri do mot ga seo bal byung nan da”
Input pitch
Generated
result
Generated singing
Audio samples
25. 25
Conditional generative models: applications
Singing Voice Generation – multi singer
J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020)
• Based on single-singer model, added Singer Identity Encoder.
• Disentangling Singer identity into Timbre and Singing Style.
26. 26
Conditional generative models: applications
Singing Voice Generation – multi singer
J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020)
• Based on single-singer model, added Singer Identity Encoder.
• Disentangling Singer identity into Timbre and Singing Style.
Generation Result
Singer A Singer B
27. 27
Conditional generative models: applications
Singing Voice Generation – multi singer
J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020)
• Based on single-singer model, added Singer Identity Encoder.
• Disentangling Singer identity into Timbre and Singing Style.
Generation Result
Singer A Singer B
Timbre A + Style B Timbre B + Style A
28. 28
Conditional generative models: applications
Generative
Model
Output
1. Signal
Condition
1. Controllability
Generative
Model
Output
a. Signal (Audio/Image)
Condition
a. Controllability
b. Signal (Image/Audio)
Randomness
a. Uncertainty
b. Creativity
What is lacking?...
Multi-modal transform
Deterministic
Some stochasticity
Can be seen as a supervised-way of disentangling representation
1.
2.
29. 29
Conditional generative models: applications
Audio Driven Dance Generation – Listen to dance
J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
30. 30
Conditional generative models: applications
Audio Driven Dance Generation – Listen to dance
J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
32. 32
Conditional generative models: applications
Audio Driven Dance Generation – Listen to dance
J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
Black pink - 불장난
Red velvet - Rookie
33. 33
Conditional generative models: applications
Audio Driven Dance Generation – Dance with melody
T Tang et al, Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis (ACMMM 2018)
34. 34
Conditional generative models: applications
Audio Driven Dance Generation – Dance with melody
T Tang et al, Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis (ACMMM 2018)
• By using autoencoder, obtain
Reduced Acoustics Features
• With Temporal Indexes mask,
Transform the frame-indexed
acoustic features into beat-
indexed acoustic features.
35. 35
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
36. 36
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
Learning How to Move
Learning How to Compose
Generation
37. 37
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
Learning How to Move
• Decompose dance sequence with kinematic beat
• With VAE, disentangle dance into initial pose + movement
38. 38
Conditional generative models (multi-modal)
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
Learning How to Compose
• Learns how to meaningfully compose a sequence of basic
movements into a dance conditioned on the input music.
• Conditional adversarial training for correspondence M&D
39. 39
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
: conditioning applied
40. 40
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
41. 41
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Stochastic part
(Uncertainty)
42. 42
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Spk1
Spk2
Spk3
Spk4
43. 43
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Fix z & Change c (speech embedding)
44. 44
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Fix c & Change z (random sampling)
Generator
Architecture
Stack of transposed convolutional layers to upsample the input sequence.
Each transposed convolutional layer followed by a stack of residual blocks.
Induced Receptive Field
Residual blocks with dilations so
temporally far output activations of each layer has significant overlapping inputs.
Receptive field of a stack of dilated convolution layers increases exponentially with the number of layers.
Discriminator
Multiscale Architecture
3 discriminators (identical structure) operate on different audio scales -- original scale, 2x and 4x downsampled.
Each discriminator biased to learn features for different frequency range of the audio.
Window-based objective
Each individual discriminator is a Markovian window-based discriminator (analogues to image patches, Isola et al. (2017))
Discriminator learns to classify between distributions of small audio chunks.
Overlapping large windows maintain coherence across patches
1. 춤
2. Audio signal generation
3. Aumon (stochasticy반영)
4. Futurework with the example of Image generation with stochasticity