The document discusses machine listening and music information retrieval. It introduces common techniques in music auto-tagging like extracting features from audio spectrograms and training classifiers. Deep learning approaches that learn features directly from data are showing promise. Recurrent neural networks are discussed for modeling temporal dependencies in music, with an example of applying them to onset detection. The talk concludes with an example of live drum transcription using drum modeling, onset detection, spectrogram slicing and non-negative source separation.
This presentation shares some basic concepts about audio processing and image processing using MATLAB and was used as teaching material for an introductory workshop.
The following resources come from the 2009/10 B.Sc in Media Technology and Digital Broadcast (course number 2ELE0073) from the University of Hertfordshire. All the mini projects are designed as level two modules of the undergraduate programmes.
This document provides an overview of a mini project on audio processing and digital signal processing. The objectives are to demonstrate digital filtering for audio signals, introducing echoes into audio, and developing knowledge of introducing artificial reverberation. Students will develop systems for generating artificial echoes and reverberations for audio files using MATLAB. The document outlines relevant digital signal processing principles and introduces digital simulation of audio reverberation.
The following resources come from the 2009/10 B.Sc in Media Technology and Digital Broadcast (course number 2ELE0076) from the University of Hertfordshire. All the mini projects are designed as level two modules of the undergraduate programmes.
This document discusses homomorphic speech processing and techniques for speech enhancement. It provides an overview of modeling speech production as the excitation of a linear time-invariant system. Homomorphic filtering is introduced as a way to deconvolve speech into excitation and system response using logarithmic transformations. The complex cepstrum is discussed as a representation of speech that can be used to estimate pitch, voicing and formant frequencies. Homomorphic vocoding is described as a speech coding technique that quantizes the low-time part of the cepstrum at regular intervals to encode speech. Common techniques for speech enhancement like spectral subtraction and adaptive noise cancellation are also mentioned.
Distance Coding And Performance Of The Mark 5 And St350 Soundfield Microphone...Bruce Wiggins
A paper presented at the Institute of Acoustics Reproduced Sound 25 conference in 2009 looking at the response of two SoundField Ambisonic microphones to sound sources at different distances from the microphone.
MPEG 4 is an object-based multimedia format that allows for low bitrate compression and transmission of audio-visual content such as video, speech, audio, and graphics. It supports scalable compression from mobile phone quality up to HDTV, as well as interactive capabilities. MPEG 4 compression uses various coding methods tailored to different object types and allows objects to be combined into scenes.
Introductory Lecture to Audio Signal ProcessingAngelo Salatino
The document provides an introduction to audio signal processing and related topics. It discusses analog and digital audio signals, the waveform audio file format (WAV) specification including its header structure, and tools for audio processing like FFmpeg and MATLAB. Example code is given to read header metadata and audio samples from a WAV file in C++. While useful for understanding audio formats and processing, the solution contains an error and FFmpeg is noted as a better library for audio tasks.
This presentation shares some basic concepts about audio processing and image processing using MATLAB and was used as teaching material for an introductory workshop.
The following resources come from the 2009/10 B.Sc in Media Technology and Digital Broadcast (course number 2ELE0073) from the University of Hertfordshire. All the mini projects are designed as level two modules of the undergraduate programmes.
This document provides an overview of a mini project on audio processing and digital signal processing. The objectives are to demonstrate digital filtering for audio signals, introducing echoes into audio, and developing knowledge of introducing artificial reverberation. Students will develop systems for generating artificial echoes and reverberations for audio files using MATLAB. The document outlines relevant digital signal processing principles and introduces digital simulation of audio reverberation.
The following resources come from the 2009/10 B.Sc in Media Technology and Digital Broadcast (course number 2ELE0076) from the University of Hertfordshire. All the mini projects are designed as level two modules of the undergraduate programmes.
This document discusses homomorphic speech processing and techniques for speech enhancement. It provides an overview of modeling speech production as the excitation of a linear time-invariant system. Homomorphic filtering is introduced as a way to deconvolve speech into excitation and system response using logarithmic transformations. The complex cepstrum is discussed as a representation of speech that can be used to estimate pitch, voicing and formant frequencies. Homomorphic vocoding is described as a speech coding technique that quantizes the low-time part of the cepstrum at regular intervals to encode speech. Common techniques for speech enhancement like spectral subtraction and adaptive noise cancellation are also mentioned.
Distance Coding And Performance Of The Mark 5 And St350 Soundfield Microphone...Bruce Wiggins
A paper presented at the Institute of Acoustics Reproduced Sound 25 conference in 2009 looking at the response of two SoundField Ambisonic microphones to sound sources at different distances from the microphone.
MPEG 4 is an object-based multimedia format that allows for low bitrate compression and transmission of audio-visual content such as video, speech, audio, and graphics. It supports scalable compression from mobile phone quality up to HDTV, as well as interactive capabilities. MPEG 4 compression uses various coding methods tailored to different object types and allows objects to be combined into scenes.
Introductory Lecture to Audio Signal ProcessingAngelo Salatino
The document provides an introduction to audio signal processing and related topics. It discusses analog and digital audio signals, the waveform audio file format (WAV) specification including its header structure, and tools for audio processing like FFmpeg and MATLAB. Example code is given to read header metadata and audio samples from a WAV file in C++. While useful for understanding audio formats and processing, the solution contains an error and FFmpeg is noted as a better library for audio tasks.
Real Time Drum Augmentation with Physical ModelingBen Eyes
This document discusses augmenting acoustic drums with physical modeling to create new sounds and performances. It summarizes previous research that used convolution or spectral processing to digitally process drum sounds. The author then describes his own project that uses a physical model of strings as a VST plugin to process drum sounds from a snare drum and rototoms in real time. An interview with the percussionist discusses the collaborative composition process and how playing with the system required experimenting with extended techniques. The author concludes that future work will involve developing their own drum models and exploring new interfaces like facial recognition to control sound parameters.
Mono recording uses a single audio channel to record sound, resulting in smaller file sizes but lacking sound perspective. Stereo recording uses two audio channels to provide sound perspective but results in larger file sizes. Uncompressed file formats like .WAV files maintain high quality but are very large, while lossy compressed formats like MP3 sacrifice some quality for much smaller file sizes and portability.
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITIONDeep Learning JP
1) The document proposes improving voice separation performance by incorporating end-to-end speech recognition neural networks trained on large speech datasets.
2) It finds that the phonetic and linguistic information learned by the end-to-end speech recognition model is beneficial for voice separation tasks, even when the quality of training data differs.
3) Evaluation on voice separation from noisy mixtures and singing voice separation with limited data finds the proposed method outperforms baselines and is robust to adverse noise conditions.
PHOENIX AUDIO TECHNOLOGIES - A large Audio Signal Algorithm PortfolioHTCS LLC
Phoenix Audio Technology has the attached publication available which lists their Audio Signal Algorithm Portfolio, e.g. Multi Sensor Processing, Blind Source Separation, Echo and Reference Channel Canceling, Single Sensor Processing, Multi Resolution Analysis, Single Power Compression, Direction Finding, Data Tracking, Data Fusion, and more.
Machine listening is a field that encompasses research on a wide range of tasks, including speech recognition, audio content recognition, audio-based search, and content-based music analysis. In this talk, I will start by introducing some of the ways in which machine learning enables computers to process and understand audio in a meaningful way. Then I will draw on some specific examples from my dissertation showing techniques for automated analysis of live drum performances. Specifically, I will focus on my work on drum detection, which uses gamma mixture models and a variant of non-negative matrix factorization, and drum pattern analysis, which uses deep neural networks to infer high-level rhythmic and stylistic information about a performance.
This presentation gives you the basic idea about surround system. Various aspects to be considered while designing surround system and it's various formats.
Digital audio was created in the late 1960s when Dr. Thomas Stockham began experimenting with digital tape recording using analog to digital converters. The key aspects of digital audio are:
1) Analog audio is converted to digital form through analog to digital conversion which samples the analog signal at regular intervals determined by the sample rate.
2) Higher sample rates and bit depths produce more accurate digital representations of the original analog signal but result in larger file sizes.
3) Quantization error, in the form of quantization noise, occurs when sample values are rounded to binary numbers during digitization and can be reduced by dithering and increasing bit depth.
This document summarizes digital modeling techniques for speech signals. It describes the vocal source and vocal tract that produce speech. It then discusses using sampling and techniques like PCM to digitally represent speech signals. Linear predictive coding is presented as a simple method to analyze speech that approximates samples as combinations of past signals. The summary concludes that linear prediction can be used for spectrum estimation by representing the vocal tract transfer function, pitch detection, and speech synthesis.
This document provides guidance on proper sound reinforcement techniques for DJs and discusses common mistakes made. It is presented as a manual with multiple chapters. The first chapter explains why using speakers in four corners, referred to as a "4 corner setup", is improper and will result in cancellations and standing waves due to sound waves crossing and interfering with each other. Models and diagrams are provided to illustrate this. The second chapter discusses proper gain structure and avoiding distortion, noting that overdriving the system sends distorted signals through the amplifiers and damages equipment. It also cautions about playing monitors and the PA system too loudly and not time aligned. Overall, the document stresses that sound propagation follows the laws of physics and certain techniques like 4 corner setups
Audio compression reduces the bandwidth and file size of digital audio streams and files. Lossy compression provides greater compression rates than lossless compression and is used in consumer devices, but introduces irreversible changes. Lossless compression produces an exact digital duplicate upon decompression. Common lossless formats are FLAC and Apple Lossless. Lossy compression is used widely and achieves much greater compression ratios, discarding less important data, but recompressing causes quality loss making it unsuitable for professional audio editing.
This document summarizes a presentation about real-time pitch detection and speech recognition in Python. It introduces a system that can perform multilingual lyric transcription and pitch detection on a singing voice in real-time. Key aspects of the system include acquiring audio data, processing it using a spectrogram for pitch detection, performing speech recognition using an API, and displaying the results of lyric transcription and pitch tracking. The presentation demonstrates the system recognizing and tracking the pitch of songs sung in different languages.
Spatial Sound 3: Audio Rendering and AmbisonicsRichard Elen
Part 3 of a series of four presentations that formed the basis of my short course on spatial audio for artists at the Music Department, Ionian University, Corfu, July 2008
This document discusses speech signal processing and speech recognition. It begins by defining speech processing and its relationship to digital signal processing. It then outlines several disciplines related to speech processing including signal processing, physics, pattern recognition, and computer science. The document discusses aspects of speech signals including phonemes, the speech waveform, and spectral envelope. It covers various aspects of speech processing including pre-processing, feature extraction, and recognition. It provides details on techniques for pre-processing, feature extraction including filtering, linear predictive coding, and cepstrum. Finally, it summarizes the main steps in a speech recognition procedure including endpoint detection, framing and windowing, feature extraction, and distortion measure calculations for recognition.
The document discusses the history and technology of sound and audio, including how sound is digitized, common sound file formats like MP3 and WAV, how MIDI works for synthesized sound, software for editing and sequencing sound, and examples of sound hardware like Creative sound cards. It provides an overview of key concepts in digitized sound and music production for multimedia.
Surround sound uses multiple audio channels to immerse the listener in the sound environment. It originated in movie theaters using many speakers to diffuse sound throughout the cinema. Modern home theater systems use 5.1 or 7.1 channel configurations with speakers in front, center, and surround positions. Surround sound formats have evolved from mono to stereo to surround formats like Dolby Digital and DTS that add discrete channels for an immersive 3D soundfield.
This document contains definitions of common digital video and audio terms. It includes explanations of technical terms like codecs, bit rates, frame rates, and video file formats. Compression methods and video hardware are also defined, such as capture cards, character generators, and video connectors. Basic editing terms are listed, such as clips, dissolves, and timecode. The glossary provides concise descriptions of over 100 important concepts in digital video technology.
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET Journal
1. The document describes a study that uses a Convolutional Neural Network (CNN) model to classify music genres based on labeled Mel spectrograms of audio clips.
2. A CNN model is trained on a dataset of 1000 audio clips across 10 genres. The trained model is then used to classify new, unlabeled audio clips by genre based on their Mel spectrogram representation.
3. CNNs are well-suited for this task as their convolutional layers can extract hierarchical features from the Mel spectrogram images that are indicative of different genres. The study aims to develop an automated music genre classification system using deep learning techniques.
Environmental Sound detection Using MFCC techniquePankaj Kumar
This document describes a project to develop an environmental sound detection and classification technique using Mel Frequency Cepstral Coefficients (MFCC) and Content Based Retrieval (CBR). The methodology involves extracting features from input sounds, clustering similar sounds, and finding matches for query sounds from the clusters. The technique was able to accurately recognize sounds already in the database, but had difficulty rejecting sounds not present. Potential applications include environmental monitoring, speaker recognition, and robot awareness. The technique shows promise but could be improved by using additional sound features and clustering.
MUSZIC GENERATION USING DEEP LEARNING PPT.pptxlife45165
To create a Streamlit application for music generation using deep learning, you need to ensure that all the elements of your Python script are correctly set up and that you handle file paths correctly, especially given the specific paths on your system.
This document discusses the use of artificial intelligence in organized sound as surveyed in the journal Organised Sound. It provides an overview of key AI technologies like Auto-Tune audio processing that can correct pitch and organize sound. Applications discussed include general sound classification, open sound control for music networking, and time-frequency representations for sound analysis and resynthesis. The document also outlines recent research on intelligent composer assistants, responsive instruments, and recognition of musical sounds. Finally, it discusses the future of AI in organizing sound through planning and machine learning.
Real Time Drum Augmentation with Physical ModelingBen Eyes
This document discusses augmenting acoustic drums with physical modeling to create new sounds and performances. It summarizes previous research that used convolution or spectral processing to digitally process drum sounds. The author then describes his own project that uses a physical model of strings as a VST plugin to process drum sounds from a snare drum and rototoms in real time. An interview with the percussionist discusses the collaborative composition process and how playing with the system required experimenting with extended techniques. The author concludes that future work will involve developing their own drum models and exploring new interfaces like facial recognition to control sound parameters.
Mono recording uses a single audio channel to record sound, resulting in smaller file sizes but lacking sound perspective. Stereo recording uses two audio channels to provide sound perspective but results in larger file sizes. Uncompressed file formats like .WAV files maintain high quality but are very large, while lossy compressed formats like MP3 sacrifice some quality for much smaller file sizes and portability.
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITIONDeep Learning JP
1) The document proposes improving voice separation performance by incorporating end-to-end speech recognition neural networks trained on large speech datasets.
2) It finds that the phonetic and linguistic information learned by the end-to-end speech recognition model is beneficial for voice separation tasks, even when the quality of training data differs.
3) Evaluation on voice separation from noisy mixtures and singing voice separation with limited data finds the proposed method outperforms baselines and is robust to adverse noise conditions.
PHOENIX AUDIO TECHNOLOGIES - A large Audio Signal Algorithm PortfolioHTCS LLC
Phoenix Audio Technology has the attached publication available which lists their Audio Signal Algorithm Portfolio, e.g. Multi Sensor Processing, Blind Source Separation, Echo and Reference Channel Canceling, Single Sensor Processing, Multi Resolution Analysis, Single Power Compression, Direction Finding, Data Tracking, Data Fusion, and more.
Machine listening is a field that encompasses research on a wide range of tasks, including speech recognition, audio content recognition, audio-based search, and content-based music analysis. In this talk, I will start by introducing some of the ways in which machine learning enables computers to process and understand audio in a meaningful way. Then I will draw on some specific examples from my dissertation showing techniques for automated analysis of live drum performances. Specifically, I will focus on my work on drum detection, which uses gamma mixture models and a variant of non-negative matrix factorization, and drum pattern analysis, which uses deep neural networks to infer high-level rhythmic and stylistic information about a performance.
This presentation gives you the basic idea about surround system. Various aspects to be considered while designing surround system and it's various formats.
Digital audio was created in the late 1960s when Dr. Thomas Stockham began experimenting with digital tape recording using analog to digital converters. The key aspects of digital audio are:
1) Analog audio is converted to digital form through analog to digital conversion which samples the analog signal at regular intervals determined by the sample rate.
2) Higher sample rates and bit depths produce more accurate digital representations of the original analog signal but result in larger file sizes.
3) Quantization error, in the form of quantization noise, occurs when sample values are rounded to binary numbers during digitization and can be reduced by dithering and increasing bit depth.
This document summarizes digital modeling techniques for speech signals. It describes the vocal source and vocal tract that produce speech. It then discusses using sampling and techniques like PCM to digitally represent speech signals. Linear predictive coding is presented as a simple method to analyze speech that approximates samples as combinations of past signals. The summary concludes that linear prediction can be used for spectrum estimation by representing the vocal tract transfer function, pitch detection, and speech synthesis.
This document provides guidance on proper sound reinforcement techniques for DJs and discusses common mistakes made. It is presented as a manual with multiple chapters. The first chapter explains why using speakers in four corners, referred to as a "4 corner setup", is improper and will result in cancellations and standing waves due to sound waves crossing and interfering with each other. Models and diagrams are provided to illustrate this. The second chapter discusses proper gain structure and avoiding distortion, noting that overdriving the system sends distorted signals through the amplifiers and damages equipment. It also cautions about playing monitors and the PA system too loudly and not time aligned. Overall, the document stresses that sound propagation follows the laws of physics and certain techniques like 4 corner setups
Audio compression reduces the bandwidth and file size of digital audio streams and files. Lossy compression provides greater compression rates than lossless compression and is used in consumer devices, but introduces irreversible changes. Lossless compression produces an exact digital duplicate upon decompression. Common lossless formats are FLAC and Apple Lossless. Lossy compression is used widely and achieves much greater compression ratios, discarding less important data, but recompressing causes quality loss making it unsuitable for professional audio editing.
This document summarizes a presentation about real-time pitch detection and speech recognition in Python. It introduces a system that can perform multilingual lyric transcription and pitch detection on a singing voice in real-time. Key aspects of the system include acquiring audio data, processing it using a spectrogram for pitch detection, performing speech recognition using an API, and displaying the results of lyric transcription and pitch tracking. The presentation demonstrates the system recognizing and tracking the pitch of songs sung in different languages.
Spatial Sound 3: Audio Rendering and AmbisonicsRichard Elen
Part 3 of a series of four presentations that formed the basis of my short course on spatial audio for artists at the Music Department, Ionian University, Corfu, July 2008
This document discusses speech signal processing and speech recognition. It begins by defining speech processing and its relationship to digital signal processing. It then outlines several disciplines related to speech processing including signal processing, physics, pattern recognition, and computer science. The document discusses aspects of speech signals including phonemes, the speech waveform, and spectral envelope. It covers various aspects of speech processing including pre-processing, feature extraction, and recognition. It provides details on techniques for pre-processing, feature extraction including filtering, linear predictive coding, and cepstrum. Finally, it summarizes the main steps in a speech recognition procedure including endpoint detection, framing and windowing, feature extraction, and distortion measure calculations for recognition.
The document discusses the history and technology of sound and audio, including how sound is digitized, common sound file formats like MP3 and WAV, how MIDI works for synthesized sound, software for editing and sequencing sound, and examples of sound hardware like Creative sound cards. It provides an overview of key concepts in digitized sound and music production for multimedia.
Surround sound uses multiple audio channels to immerse the listener in the sound environment. It originated in movie theaters using many speakers to diffuse sound throughout the cinema. Modern home theater systems use 5.1 or 7.1 channel configurations with speakers in front, center, and surround positions. Surround sound formats have evolved from mono to stereo to surround formats like Dolby Digital and DTS that add discrete channels for an immersive 3D soundfield.
This document contains definitions of common digital video and audio terms. It includes explanations of technical terms like codecs, bit rates, frame rates, and video file formats. Compression methods and video hardware are also defined, such as capture cards, character generators, and video connectors. Basic editing terms are listed, such as clips, dissolves, and timecode. The glossary provides concise descriptions of over 100 important concepts in digital video technology.
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET Journal
1. The document describes a study that uses a Convolutional Neural Network (CNN) model to classify music genres based on labeled Mel spectrograms of audio clips.
2. A CNN model is trained on a dataset of 1000 audio clips across 10 genres. The trained model is then used to classify new, unlabeled audio clips by genre based on their Mel spectrogram representation.
3. CNNs are well-suited for this task as their convolutional layers can extract hierarchical features from the Mel spectrogram images that are indicative of different genres. The study aims to develop an automated music genre classification system using deep learning techniques.
Environmental Sound detection Using MFCC techniquePankaj Kumar
This document describes a project to develop an environmental sound detection and classification technique using Mel Frequency Cepstral Coefficients (MFCC) and Content Based Retrieval (CBR). The methodology involves extracting features from input sounds, clustering similar sounds, and finding matches for query sounds from the clusters. The technique was able to accurately recognize sounds already in the database, but had difficulty rejecting sounds not present. Potential applications include environmental monitoring, speaker recognition, and robot awareness. The technique shows promise but could be improved by using additional sound features and clustering.
MUSZIC GENERATION USING DEEP LEARNING PPT.pptxlife45165
To create a Streamlit application for music generation using deep learning, you need to ensure that all the elements of your Python script are correctly set up and that you handle file paths correctly, especially given the specific paths on your system.
This document discusses the use of artificial intelligence in organized sound as surveyed in the journal Organised Sound. It provides an overview of key AI technologies like Auto-Tune audio processing that can correct pitch and organize sound. Applications discussed include general sound classification, open sound control for music networking, and time-frequency representations for sound analysis and resynthesis. The document also outlines recent research on intelligent composer assistants, responsive instruments, and recognition of musical sounds. Finally, it discusses the future of AI in organizing sound through planning and machine learning.
Audio Signal Identification and Search Approach for Minimizing the Search Tim...aciijournal
Audio or music fingerprints can be utilize to implement an economical music identification system on a
million-song library, however the system needs great deal of memory to carry the fingerprints and indexes.
Therefore, for a large-scale audio library, memory imposes a restriction on the speed of music
identifications. So, we propose an efficient music identification system which used a kind of space-saving
audio fingerprints. For saving space, original finger representations are sub-sample and only one quarters
of the original data is reserved. In this approach, memory demand is far reduced and therefore the search
speed is criticalincreasing whereas the lustiness and dependability ar well preserved. Mapping
audio information to time and frequency domain for the classification, retrieval or identification tasks
presents four principal challenges. The dimension of the input should be considerably reduced;
the ensuing options should be strong to possible distortions of the input; the feature should be informative
for the task at hand simple. We propose distortion free system which fulfils all four of these requirements.
Extensive study has been done to compare our system with the already existing ones, and the results show
that our system requires less memory, provides fast results and achieves comparable accuracy for a largescale database.
KEYWORDS
AUDIO SIGNAL IDENTIFICATION AND SEARCH APPROACH FOR MINIMIZING THE SEARCH TIM...aciijournal
This document describes an approach for improving the speed of audio fingerprint searches in large audio databases. It proposes using a more compact representation of audio fingerprints that reduces the memory requirements, while still maintaining accuracy. The key steps are: 1) extracting fingerprints from audio clips by transforming them into spectrograms and filtering specific frequency bands, 2) further compressing the fingerprints using wavelet decomposition and selecting the most informative components, and 3) indexing the compressed fingerprints using min-hash to allow fast retrieval of similar fingerprints from the database. The approach aims to significantly reduce search time compared to existing audio fingerprinting systems, while achieving comparable accuracy.
Audio Signal Identification and Search Approach for Minimizing the Search Tim...aciijournal
Audio or music fingerprints can be utilize to implement an economical music identification system on a
million-song library, however the system needs great deal of memory to carry the fingerprints and indexes.
Therefore, for a large-scale audio library, memory imposes a restriction on the speed of music
identifications. So, we propose an efficient music identification system which used a kind of space-saving
audio fingerprints. For saving space, original finger representations are sub-sample and only one quarters
of the original data is reserved. In this approach, memory demand is far reduced and therefore the search
speed is criticalincreasing whereas the lustiness and dependability ar well preserved. Mapping
audio information to time and frequency domain for the classification, retrieval or identification tasks
presents four principal challenges. The dimension of the input should be considerably reduced;
the ensuing options should be strong to possible distortions of the input; the feature should be informative
for the task at hand simple. We propose distortion free system which fulfils all four of these requirements.
Extensive study has been done to compare our system with the already existing ones, and the results show
that our system requires less memory, provides fast results and achieves comparable accuracy for a largescale database.
Large-Scale Capture of Producer-Defined Musical Semantics - Ryan Stables (Sem...sebastianewert
This document proposes a project to gather large amounts of semantic data from music producers during the music creation process. It involves developing plugins that extract low-level audio features and allow producers to annotate them with semantic descriptors. The goals are to (1) collect semantic data, (2) identify correlations in the data, and (3) use these correlations to aid music production tasks. As a proof of concept, the document outlines a mini-project to collect semantic data from musicians and evaluate suitable systems for future research.
Deep learning for music classification, 2016-05-24Keunwoo Choi
This document describes a presentation on deep learning for music classification. It discusses using deep convolutional neural networks (CNNs) for music classification tasks like genre classification, instrument identification, and automatic music tagging. CNNs can learn hierarchical music features from raw audio or time-frequency representations directly from data without requiring designed features. The presentation provides examples of applying CNNs to automatically tag music with descriptive keywords using a multi-label classification approach.
Audio Signal Identification and Search Approach for Minimizing the Search Tim...aciijournal
Audio or music fingerprints can be utilize to implement an economical music identification system on a
million-song library, however the system needs great deal of memory to carry the fingerprints and indexes.
Therefore, for a large-scale audio library, memory
imposes a restriction on the speed of music
identifications. So, we propose an efficient music
identification system which used a kind of space-saving
audio fingerprints. For saving space, original finger representations are sub-sample and only one quarters
of the original data is reserved. In this approach,
memory demand is far reduced and therefore the search
speed is criticalincreasing whereas the lustiness and dependability are well preserved. Mapping
audio information to time and frequency domain for
the classification, retrieval or identification tasks
presents four principal challenges. The dimension o
f the input should be considerably reduced;
the ensuing options should be strong to possible distortions of the input; the feature should be informative
for the task at hand simple. We propose distortion
free system which fulfils all four of these requirements.
Extensive study has been done to compare our system
with the already existing ones, and the results sh
ow
that our system requires less memory, provides fast
results and achieves comparable accuracy for a large-
scale database.
The document is a glossary of terms related to sound design and production for computer games. It contains definitions for terms like Foley artistry, sound libraries, audio file formats like .wav and .mp3, audio limitations involving hardware, recording systems, sampling, and more. For each term, it provides a short definition from an online source as well as any relevance to the author's own production practice.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
This document contains a glossary of terms related to sound design and production for computer games. It includes definitions that Adam Copeland researched for terms like Foley artistry, sound libraries, common audio file formats like .wav and .mp3, audio limitations of hardware like the Sound Processor Unit and Digital Signal Processor, and audio techniques like mono, stereo, and surround sound. For each term, Copeland provides a short definition from an online source and describes how the term relates to his own production practice.
This document contains a glossary of terms related to sound design and production for computer games. It includes definitions that Adam Copeland researched for terms like Foley artistry, sound libraries, common audio file formats like .wav and .mp3, audio limitations of hardware like the Sound Processor Unit and Digital Signal Processor, and audio techniques like mono, stereo, and surround sound. For each term, Copeland provides a short definition from an online source as well as how the term relates to his own production practice, such as recording sounds for editing and using different file formats.
This document contains a glossary of terms related to sound design and production. It defines terms such as Foley artistry, sound libraries, audio file formats like .wav and .mp3, audio limitations including mono and stereo audio, audio recording systems like analog and digital, MIDI, software sequencers, plugins, and MIDI keyboard instruments. For each term, it provides a short definition and URL source, as well as describing the relevance of the term to the author's own production practice, such as using .wav files and mono audio when editing sounds, and using MIDI for creating custom music tracks.
Automatic Music Generation Using Deep LearningIRJET Journal
This document discusses automatic music generation using deep learning. It begins with an abstract describing how music is generated in the form of a sequence of ABC notes using deep learning concepts. LSTM or GRUs are commonly used for music generation as recurrent neural networks that can efficiently model sequences. The main purpose of the project described is to generate melodious and rhythmic music automatically using a recurrent neural network. It reviews approaches like WaveNet and LSTM for music generation and tools like Magenta and DeepJazz. The design uses a character RNN and LSTM network to classify and predict the next character in an ABC notation sequence to generate music.
The document is a glossary of terms related to sound design and production for computer games created by Phillip Norris Wynne. It contains definitions for over 20 key terms sourced from online references. The terms cover areas such as sound design methodology, file formats, limitations on audio hardware, recording systems, MIDI and software tools. For each term, Phillip provides the online definition and assesses how the term relates to his own production practice. The glossary demonstrates Phillip's research into fundamental concepts and technologies that inform sound design for games.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Presentation of the Allosphere project at UCSB. Imagine a 3 story high sphere suspended in a cube where 3D video and audio are used for scientific discovery and exploration.
Similar to MLConf2013: Teaching Computer to Listen to Music (20)
CAKE: Sharing Slices of Confidential Data on BlockchainClaudio Di Ciccio
Presented at the CAiSE 2024 Forum, Intelligent Information Systems, June 6th, Limassol, Cyprus.
Synopsis: Cooperative information systems typically involve various entities in a collaborative process within a distributed environment. Blockchain technology offers a mechanism for automating such processes, even when only partial trust exists among participants. The data stored on the blockchain is replicated across all nodes in the network, ensuring accessibility to all participants. While this aspect facilitates traceability, integrity, and persistence, it poses challenges for adopting public blockchains in enterprise settings due to confidentiality issues. In this paper, we present a software tool named Control Access via Key Encryption (CAKE), designed to ensure data confidentiality in scenarios involving public blockchains. After outlining its core components and functionalities, we showcase the application of CAKE in the context of a real-world cyber-security project within the logistics domain.
Paper: https://doi.org/10.1007/978-3-031-61000-4_16
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
OpenID AuthZEN Interop Read Out - AuthorizationDavid Brossard
During Identiverse 2024 and EIC 2024, members of the OpenID AuthZEN WG got together and demoed their authorization endpoints conforming to the AuthZEN API
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Things to Consider When Choosing a Website Developer for your Website | FODUUFODUU
Choosing the right website developer is crucial for your business. This article covers essential factors to consider, including experience, portfolio, technical skills, communication, pricing, reputation & reviews, cost and budget considerations and post-launch support. Make an informed decision to ensure your website meets your business goals.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
1. TEACHING COMPUTERS TO LISTEN TO MUSIC
Eric Battenberg
ebattenberg@gracenote.com
http://ericbattenberg.com
Gracenote, Inc.
Previously:
UC Berkeley, Dept. of EECS
Parallel Computing Laboratory
CNMAT (Center for New Music and Audio Technologies)
2. Business Verticals
Rich, Diverse
Multimedia Metadata
Cloud Music
Services,
Mobile Apps and
Devices
Smart TVs, Second
Screen Apps,
Targeted Ads
Connected
Infotainment
Systems on the
Road
2
3. SOME OF GRACENOTE’S PRODUCTS
Original product: CDDB (1998)
Recognize CDs and retrieve associated metadata
Scan & Match to precisely identify digital music quickly in
large catalogs
ACR (Automatic Content Recognition)
Audio/Video Fingerprinting, Enabling second screen experiences
Real-time audio classification systems for mobile devices.
Contextually aware cross-modal personalization
Taste profiling for Music, TV, Movies, Social networks
Music Mood descriptors that combine machine learning and
editorial training data.
3
6. MACHINE LISTENING
Speech processing
Speech processing makes up the vast majority of funded
machine listening research.
Just as there’s more to Computer Vision than OCR, there’s more
to machine listening than speech recognition!
Audio content analysis
Audio classification (music, speech, noise, laughter, cheering)
Audio fingerprinting (e.g. Gracenote, Shazam)
Audio event detection (new song, channel change, hotword)
Content-based Music Information Retrieval (MIR)
Today’s topic
6
7. GETTING COMPUTERS TO
“LISTEN” TO MUSIC
Not trying to get computers to “listen” for enjoyment.
More accurate: Analyzing music with computers.
What kind of information to get out of the analysis?
What instruments are playing?
What is the mood?
How fast/slow is it (tempo)?
What does the singer sound like?
How can I play this song on my instrument?
Ben Harper
James Brown
7
8. CONTENT-BASED MUSIC INFORMATION
RETRIEVAL
Many tasks:
Genre, mood classification, auto-tagging
Beat tracking, tempo detection
Music similarity, playlisting, recommendation
Heavily aided by collaborative filtering
Automatic music transcription
Source separation (voice, drums, bass…)
Music segmentation (verse, chorus)
8
9. TALK SUMMARY
Introduction to Music Information Retrieval (MIR)
Some common techniques
Exciting new research directions
Auto-tagging, onset detection
Deep learning!
Example: Live Drum Understanding
Drum detection/transcription
9
10. TALK SUMMARY
Introduction to Music Information Retrieval (MIR)
Some common techniques
Exciting new research directions
Auto-tagging, onset detection
Deep learning!
Example: Live Drum Understanding
Drum detection/transcription
10
11. QUICK LESSON: THE SPECTROGRAM
The spectrogram: Very common feature used in audio analysis.
Time-frequency representation of audio.
Take FFT of adjacent frames of audio samples, put them in a
matrix.
Each column shows frequency content at a particular time.
Frequency
11
Time
12. MUSIC AUTO-TAGGING:
TYPICAL APPROACHES
Typical approach:
Extract a bunch of hand-designed features describing small
windows of the signal (e.g., spectral centroid, kurtosis,
harmonicity, percussiveness, MFCCs, 100’s more…).
Train a GMM or SVM to predict genre/mood/tags.
Pros:
Works fairly well, was state of the art for a while.
Well understood models, implementations widely available.
Cons:
Bag-of-frames style approach lacks ability to describe rhythm and
temporal dynamics.
Getting further improvements requires hand designing more
features.
12
13. LEARNING FEATURES:
NEURAL NETWORKS
Each layer computes a non-linear
transformation of the previous layer.
Train to minimize output error.
Each hidden layer can be thought of as
a set of features.
Train using backpropagation.
Iterative steps:
Compute output error
Backprop. error signal
Compute gradients
Compute activations
0.0
Backpropagate
1.0
Feedfordward
Sigmoid non-linearity
Update all weights.
Resurgence of neural networks:
More compute (GPUs, Google, etc.)
More data
A few new tricks…
13
14. DEEP NEURAL NETWORKS
Deep Neural Networks
Problem: Vanishing error signal.
Weights of lower layers do not
change much.
Solutions:
Train for a really long time.
Pre-train each hidden layer as an
autoencoder. [Hinton, 2006]
Rectified Linear Units [Krizhevsky,
2012]
Sigmoid non-linearity
Backpropagate
Millions to billions of parameters
Many layers of “features”
Achieving state of the art
performance in vision and speech
tasks.
Feedfordward
Rectifier non-linearity
14
15. AUTOENCODERS AND
UNSUPERVISED FEATURE LEARNING
Many ways to learn features in an
unsupervised way:
Autoencoders – train a network to
reconstruct the input
Reconstructed
Hiddens Data
(Features)
Denoising Autoencoders [Vincent, 2008]
Data
Restricted Boltzmann Machine (RBM)
[Hinton, 2006]
Autoencoder:
Train to reconstruct input
Sparse Autoencoders
Clustering – K-Means, mixture models,
etc.
Sparse Coding – learn overcomplete
dictionary of features with sparsity
constraint
15
16. MUSIC AUTO-TAGGING:
NEWER APPROACHES
Newer approaches to feature extraction:
Learn spectral features using Restricted Boltzmann
Machines (RBMs) and Deep Neural Networks (DNN)
[Hamel, 2010] – good genre performance.
Learn sparse features using Predictive Sparse
Decomposition (PSD) [Henaff, 2011] – good genre
performance
Learn beat-synchronous rhythm and timbre features with
RBMs and DNNs [Schmidt, 2013] – improved mood
performance
Pros:
Some learned
frequency features
Hand-designing individual features is not required.
Computers can learn complex high-order features that
humans cannot hand code.
[Henaff, 2011]
Missing pieces:
More work on incorporating time: context, rhythm, and longterm structure into feature learning
16
17. RECURRENT NEURAL NETWORKS
Non-linear sequence model
Hidden units have connections to previous time step
Unlike HMMs, can model long-term dependencies using
distributed hidden state.
Recent developments (+ more compute) have made them much
more feasible to train.
Standard Recurrent Neural Network
Output units
Distributed
hidden state
Input units
17
from [Sutskever, 2013]
18. APPLYING RNNS TO ONSET DETECTION
Onset Detection:
Audio Signal
Detect the beginning of notes
Important for music transcription, beat
tracking, etc.
Onset detection can be hard for certain
instruments with ambiguous attacks or
in the presence of background
interference.
Onset Envelope
18
from (Bello, 2005)
19. ONSET DETECTION: STATE-OF-THE-ART
Using Recurrent Neural Networks (RNN) [Eyben, 2010],
[Böck, 2012]
Spectrogram + Δ’s
@ 3 time-scales
3 layer RNN
Onset Detection
Function
3 Layer RNN
3 Layer RNN
RNN output trained to predict onset locations.
80-90% accurate (state-of-the art), compared to 60-80%
Can improve with more labeled training data, or possibly more
unsupervised training.
19
20. OTHER EXAMPLES OF RNNS IN MUSIC
Examples:
Classical music generation and transcription [BoulangerLewandowski, 2012]
Polyphonic piano note transcription [Böck, 2012]
NMF-based source separation with predictive constraints from
RNN. [ICASSP 2014 “Deep Learning in Music”].
RNNs are a promising way to model long-term contextual and
temporal dependencies present in music.
20
21. TALK SUMMARY
Introduction to Music Information Retrieval (MIR)
Some common techniques
Exciting new research directions
Auto-tagging, onset detection
Deep learning!
Example: Live Drum Understanding
Drum detection/transcription
21
22. LIVE DRUM TRANSCRIPTION
Real-Time/Live operation
Useful with any percussion setup.
Before a performance, we can quickly train the system for a particular
percussion setup.
Or train a more general model for common drum types.
Amplitude (dynamics) information.
Very important for musical understanding
22
23. DRUM DETECTION SYSTEM
training data
(drum-wise audio)
Drum Modeling
Onset
Detection
Gamma
Mixture Model
cluster parameters
(drum templates)
performance
(raw audio)
Spectrogram
Slice
Extraction
training data
(drum-wisedata
training audio)
(drum-wise audio)
performance
(raw audio)
performance
(raw audio)
Feature Extraction
Onset
Training
Onset
Detection
Detection
Performance
Non-negative
Vector
Decomposition
drum
activations
Source Separation
Gamma
Gamma
Mixture Model
Mixture Model
cluster parameters
(drum parameters
cluster templates)
(drum templates)
Non-negative
23
24. DRUM DETECTION SYSTEM
training data
(drum-wise audio)
Drum Modeling
Onset
Detection
Gamma
Mixture Model
cluster parameters
(drum templates)
performance
(raw audio)
Spectrogram
Slice
Extraction
training data
(drum-wisedata
training audio)
(drum-wise audio)
performance
(raw audio)
performance
(raw audio)
Feature Extraction
Onset
Training
Onset
Detection
Detection
Performance
Non-negative
Vector
Decomposition
drum
activations
Source Separation
Gamma
Gamma
Mixture Model
Mixture Model
cluster parameters
(drum parameters
cluster templates)
(drum templates)
Non-negative
24
26. DRUM DETECTION SYSTEM
training data
(drum-wise audio)
Drum Modeling
Onset
Detection
Gamma
Mixture Model
cluster parameters
(drum templates)
performance
(raw audio)
Spectrogram
Slice
Extraction
training data
(drum-wisedata
training audio)
(drum-wise audio)
performance
(raw audio)
performance
(raw audio)
Feature Extraction
Onset
Training
Onset
Detection
Detection
Performance
Non-negative
Vector
Decomposition
drum
activations
Source Separation
Gamma
Gamma
Mixture Model
Mixture Model
cluster parameters
(drum parameters
cluster templates)
(drum templates)
Non-negative
26
27. MODELING DRUM SOUNDS WITH
GAMMA MIXTURE MODELS
Instead of taking an “average” of all training slices
for a single drum…
Model each drum using means from Gamma
Mixture Model.
Captures range of sounds produced by each drum.
Cheaper to train than GMM (no covariance matrix)
More stable than GMM (no covariance matrix)
Cluster according to human auditory perception.
Not Euclidean distance
Keep audio spectrogram in linear domain (not logdomain)
Important for linear source separation
27
28. DRUM DETECTION SYSTEM
training data
(drum-wise audio)
Drum Modeling
Onset
Detection
Gamma
Mixture Model
cluster parameters
(drum templates)
performance
(raw audio)
Spectrogram
Slice
Extraction
training data
(drum-wisedata
training audio)
(drum-wise audio)
performance
(raw audio)
performance
(raw audio)
Feature Extraction
Onset
Training
Onset
Detection
Detection
Performance
Non-negative
Vector
Decomposition
drum
activations
Source Separation
Gamma
Gamma
Mixture Model
Mixture Model
cluster parameters
(drum parameters
cluster templates)
(drum templates)
Non-negative
28
29. DECOMPOSING ONSETS ONTO TEMPLATES
Non-negative Vector Decomposition (NVD)
A simplification of Non-negative Matrix Factorization (NMF)
W matrix contains drum templates in its columns.
Adding a sparsity penalty (L1) on h improves NVD.
Solve:
Input
x
Drum Templates
=
W
*
Bass
Snare
Hi-Hat
(closed)
Hi-Hat
(open)
Ride
29
h
Output
30. EVALUATION
Test Data:
Recorded to stereo using multiple
microphones.
8 different drums/cymbals
50-100 training hits per drum.
Vary maximum number of templates per
drum
30
31. DETECTION RESULTS
Varying maximum templates per drum.
More templates = better accuracy
Detection Accuracy
(F-score)
Amplitude Accuracy
(Cosine Similarity)
31
More Templates Per Drum
More Templates Per Drum
32. DRUM DETECTION: MAIN POINTS
Gamma Mixture Model
Cheaper to train than GMM
More stable than GMM
For learning spectral drum templates.
Keep audio spectrogram in linear domain (not log-domain)
Non-negative Vector Decomposition (NVD)
For computing template activations from drum onsets.
Learning multiple templates per drum improves source
separation and detection.
32
33. AUDIO EXAMPLES
100 BPM, rock, syncopated snare drum, fills/flams.
Note the messy fills in the single template version.
Original Performance
Multiple Templates per Drum
Single Template per Drum
33
34. AUDIO EXAMPLES
181 BPM, fast rock, open hi-hat
Note the extra hi-hat notes in the single template version.
Original Performance
Multiple Templates per Drum
Single Template per Drum
34
35. AUDIO EXAMPLES
94 BPM, snare drum march, accented notes
Note the extra bass drum notes and many extra cymbals in the single template version.
Original Performance
Multiple Templates per Drum
Single Template per Drum
35
36. SUMMARY
Content-Based Music Information Retrieval
Mood, genre, onsets, beats, transcription, recommendation, and
much, much more!
Exciting directions include deep/feature learning and RNNs.
Drum Detection
Gamma mixture model template learning.
Learning multiple templates = better source separation
performance
36
37. TEACHING COMPUTERS TO LISTEN TO MUSIC
Eric Battenberg
ebattenberg@gracenote.com
http://ericbattenberg.com
Gracenote, Inc.
Previously:
UC Berkeley, Dept. of EECS
Parallel Computing Laboratory
CNMAT (Center for New Music and Audio Technologies)
38. GETTING INVOLVED IN MUSIC INFORMATION
RETRIEVAL
Check out the proceedings of ISMIR (free online):
Participate in MIREX (annual MIR eval):
http://www.music-ir.org/mirex/wiki/MIREX_HOME
Join the Music-IR mailing list:
http://www.ismir.net/
http://listes.ircam.fr/wws/info/music-ir
Join the Music Information Retrieval Google Plus community
(just started it):
https://plus.google.com/communities/109771668656894350107
38
39. REFERENCES
Genre/Mood Classification:
[1] P. Hamel and D. Eck, “Learning features from music audio with deep belief networks,” Proc. of the
11th International Society for Music Information Retrieval Conference (ISMIR 2010), pp. 339–344,
2010.
[2] M. Henaff, K. Jarrett, K. Kavukcuoglu, and Y. LeCun, “Unsupervised learning of sparse features
for scalable audio classification,” Proceedings of International Symposium on Music Information
Retrieval (ISMIR’11), 2011.
[3] J. Anden and S. Mallat, “Deep Scattering Spectrum,” arXiv.org. 2013.
[4] E. Schmidt and Y. Kim, “Learning rhythm and melody features with deep belief networks,” ISMIR,
2013.
39
40. REFERENCES
Onset Detection:
[1] J. Bello, L. Daudet, S. A. Abdallah, C. Duxbury, M. Davies, and M. Sandler, “A tutorial on onset
detection in music signals,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, p.
1035, 2005.
[2] A. Klapuri, A. Eronen, and J. Astola, “Analysis of the meter of acoustic musical signals,” IEEE
Transactions on Speech and Audio Processing, vol. 14, no. 1, p. 342, 2006.
[3] J. Bello, C. Duxbury, M. Davies, and M. Sandler, “On the use of phase and energy for musical
onset detection in the complex domain,” IEEE Signal Processing Letters, vol. 11, no. 6, pp. 553–556,
2004.
[4] S. Böck, A. Arzt, F. Krebs, and M. Schedl, “Online realtime onset detection with recurrent neural
networks,” presented at the International Conference on Digital Audio Effects (DAFx-12), 2012.
[5] F. Eyben, S. Böck, B. Schuller, and A. Graves, “Universal onset detection with bidirectional longshort term memory neural networks,” Proc. of ISMIR, 2010.
40
41. REFERENCES
Neural Networks :
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional
neural networks,” Advances in neural information processing systems, 2012.
Unsupervised Feature Learning:
[2] G. E. Hinton and S. Osindero, “A fast learning algorithm for deep belief nets,” Neural
Computation, 2006.
[3] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust
features with denoising autoencoders,” pp. 1096–1103, 2008.
Recurrent Neural Networks:
[4] I. Sutskever, “Training Recurrent Neural Networks,” 2013.
[6] D. Eck and J. Schmidhuber, “Finding temporal structure in music: Blues improvisation with LSTM
recurrent networks,” pp. 747–756, 2002.
[7] S. Bock and M. Schedl, “Polyphonic piano note transcription with recurrent neural networks,” pp.
121–124, 2012.
[8] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “Modeling temporal dependencies in highdimensional sequences: Application to polyphonic music generation and transcription,” arXiv preprint
arXiv:1206.6392, 2012.
[9] G. Taylor and G. E. Hinton, “Two Distributed-State Models For Generating High-Dimensional Time
Series,” Journal of Machine Learning Research, 2011.
41
42. REFERENCES
Drum Understanding:
[1] E. Battenberg, “Techniques for Machine Understanding of Live Drum Performances,” PhD Thesis,
University of California, Berkeley, 2012.
[2] E. Battenberg and D. Wessel, “Analyzing drum patterns using conditional deep belief networks,”
presented at the International Society for Music Information Retrieval Conference, Porto, Portugal,
2012.
[3] E. Battenberg, V. Huang, and D. Wessel, “Toward live drum separation using probabilistic spectral
clustering based on the Itakura-Saito divergence,” presented at the Audio Engineering Society
Conference: 45th International Conference: Applications of Time-Frequency Processing in Audio,
2012, vol. 3.
42
And ultimately, transform how people experience their favorite movies, TV shows and music
GMM = Gaussian Mixture ModelSVM = Support Vector Machine
Snare, bass, hi-hat is not all there is to drumming
Feature extraction -> machine learning
Feature extraction -> machine learning
80 bark bands down from 513 positive FFT coefficientsWork in 2008 shows that this type of spectral dimensionality reduction speeds up NMF while retaining/improving separation performance
Feature extraction -> machine learning
Feature extraction -> machine learning
\vec{h}_i are template activations
Superior drummer 2.0 is highly multi-sampled so it is a good approximation of a real-world signal (thought it’s probably a little too professional sounding for most people)
Actual optimal number of templates:Drum:Bass,snare,cHihat,oHihat,RideHead:{3,4,3,2,2}Tail:{2,4,3,2,2}
Actual optimal number of templates:Drum:Bass,snare,cHihat,oHihat,RideHead:{3,4,3,2,2}Tail:{2,4,3,2,2}
Actual optimal number of templates:Drum:Bass,snare,cHihat,oHihat,RideHead:{3,4,3,2,2}Tail:{2,4,3,2,2}