SlideShare a Scribd company logo
All you need is AI and music
DS-GA 1011 F'21
Keunwoo Choi, , 2021-12-08
Keunwoo Choi
keunwoochoi.github.io
@keunwoochoi
• 🎶 Research scientist at ByteDance/TikTok

• 🎵 Research scientist at Spotify

• 🎼 PhD program, Queen Mary University of London

• 🔈 Acoustic research engineer, ETRI, Korea

• 🎧 Applied Acoustics (Master's program), Seoul National Univ.

• 🎸 EECS (Bs), Seoul National Univ.
Until
Now
2020
2018
2014
2011
2009
https://www.youtube.com/channel/UC6WGQvwwM3M7sX98zJ14XPA
Honor Code on paying attention to Keunwoo Choi’s music
As a student of "Natural Language Processing with Representation Learning", I
- listened to all the music (0:00 to the end) Keunwoo Choi uploaded on his YouTube channel,
- clicked "like" an odd-number times,
- clicked "subscribe" button an odd-number times,
- turned on the notification, and
- shared the channel and your top-30 favorite tracks.
Signature __________
Name __________
Date __________
Abstract 🍃
"..What is AI, and music AI? In this talk, we review the trend in music AI in four
categories - Analysis / Creation / Signal Synthesis / Signal Processing.
We put a special focus on Analysis; of timbre, notes, and lyrics.
Our goal is to understand what music AI researchers aim, assume, develop, overlook,
and misunderstand."
Abstract 🍃
"..What is AI, and music AI? In this talk, we review the trend in music AI in four
categories - Analysis / Creation / Signal Synthesis / Signal Processing.
We put a special focus on Analysis; of timbre, notes, and lyrics.
Our goal is to understand what music AI researchers aim, assume, develop, overlook,
and misunderstand."
Content
• Music AI [35 min]

• Analysis / Creation / Signal Synthesis / Signal Processing

• Analysis: [30 min]
• Timbral Understanding [15 min]

• Note-level Understanding [10 min]

• Lyric Understanding [ 5 min]
Content
• Music AI [35 min]

• Analysis / Creation / Signal Synthesis / Signal Processing

• Analysis: [30 min]
• Timbral Understanding [15 min]

• Note-level Understanding [10 min]

• Lyric Understanding [ 5 min]
All you need is AI and music
DS-GA 1011 F'21
Keunwoo Choi, , 2021-12-08
All you need is AI and music
DS-GA 1011 F'21
Keunwoo Choi, , 2021-12-08
Music AI
Music AI
(ICMC '84)
Music AI
• Machines doing something musical as a response of some musical inputs
(ICMC '84)
Music AI
• Machines doing something musical as a response of some musical inputs
(ICMC '84)
Music AI
• Machines doing something musical as a response of some musical inputs
(ICMC '84)
Music AI
Input  Output Information Signal
Signal
Analysis

(e.g., genre classification,

music similarity)
Audio Signal Processing

(e.g., automatic mixing,

source separation)
Information
Creation

(e.g., automatic composition,

lyric generation)
Audio Synthesis

(e.g., singing voice generation,

instrument sound synthesis )
Music AI
Input  Output Information Signal
Signal
Analysis

(e.g., genre classification,

music similarity)
Audio Signal Processing

(e.g., automatic mixing,

source separation)
Information
Creation

(e.g., automatic composition,

lyric generation)
Audio Synthesis

(e.g., singing voice generation,

instrument sound synthesis )
Music AI - Synthesis
Background: Synthesizer
Music AI - Synthesis
Background: Synthesizer
Music AI - Synthesis
Background: Synthesizer
Volume knob
Keys to control
the pitch
Many knobs to control the
timbre
Music AI - Synthesis
Background: Synthesizer
Music AI - Synthesis
Background: Synthesizer
Music AI - Synthesis
Background: Synthesizer
Music AI - Synthesis
Background: Synthesizer
Volume knob
Keys to control
the pitch
Many knobs for
timbre control Synthesizer
THE SOUND
YOU WANT
Three Components of Sound
Loudness, Pitch, and the rest (=Timbre)
• Timbre: "that attribute of auditory sensation which enables a listener to judge
that two nonidentical sounds, similarly presented and having the same
loudness and pitch, are dissimilar" (Acoustical Society of America)

• Sound (that we perceive) := Loudness, Pitch, and the rest (Timbre)
Volume knob
Keys to pitch
Knobs to timbre
Synthesizer The sound
Three Components of Sound
Loudness, Pitch, and the rest (=Timbre)
• Timbre: "that attribute of auditory sensation which enables a listener to judge
that two nonidentical sounds, similarly presented and having the same
loudness and pitch, are dissimilar" (Acoustical Society of America)

• Sound (that we perceive) := Loudness, Pitch, and the rest (Timbre)
Volume knob
Keys to pitch
Knobs to timbre
Synthesizer The sound
Autoencoder
Music AI - Synthesis
Background: Autoencoder
Input
(28 x 28 = 784pixel)
Output
(28 x 28 = 784 pixel)
Module 1

(Encoder)
Module 2

(Decoder)
16D
Autoencoder
Music AI - Synthesis
Background: Autoencoder
Input
(28 x 28 = 784pixel)
Output
(28 x 28 = 784 pixel)
Module 1

(Encoder)
Module 2

(Decoder)
16D
Compress the input
in some sense
Decompress
Music AI - Synthesis
DDSP - Differential Digital Signal Processing (2019, Engel et al.)
Music AI - Synthesis
DDSP Explained
Input 

Sound 

(instrument,

monophonic)
Output 

Sound
Music AI - Synthesis
DDSP Explained
Input 

Sound 

(instrument,

monophonic)
Synthesized

sound

(tonal)
Synthesized

sound

(noise)
Reverb
Output 

Sound
Music AI - Synthesis
DDSP Explained
Input 

Sound 

(instrument,

monophonic)
Synthesized

sound

(tonal)
Synthesized

sound

(noise)
Reverb
Output 

Sound
F0
Estimation
Music AI - Synthesis
DDSP Explained
Input 

Sound 

(instrument,

monophonic)
Synthesized

sound

(tonal)
Synthesized

sound

(noise)
Reverb
Output 

Sound
F0
Estimation
Pitch Recognition
Music AI - Synthesis
DDSP Explained
Input 

Sound 

(instrument,

monophonic)
Synthesized

sound

(tonal)
Synthesized

sound

(noise)
Reverb
Output 

Sound
F0
Estimation
Pitch Recognition
Loudness Recognition
Music AI - Synthesis
DDSP Explained
Input 

Sound 

(instrument,

monophonic)
Synthesized

sound

(tonal)
Synthesized

sound

(noise)
Reverb
Output 

Sound
F0
Estimation
Postprocessing
Pitch Recognition
Loudness Recognition
Music AI - Synthesis
DDSP Explained
Input 

Sound 

(instrument,

monophonic)
Synthesized

sound

(tonal)
Synthesized

sound

(noise)
Reverb
Output 

Sound
F0
Estimation
Postprocessing
Pitch Recognition
Loudness Recognition
Postprocessing
Music AI - Synthesis
DDSP Explained
Input 

Sound 

(instrument,

monophonic)
Synthesized

sound

(tonal)
Synthesized

sound

(noise)
Reverb
Output 

Sound
F0
Estimation
Postprocessing
Pitch Recognition
Loudness Recognition
Postprocessing
Volume knob
Keys to control
Knobs for timbre control Synthesizer Sound
Music AI - Synthesis
DDSP Explained
Input 

Sound 

(instrument,

monophonic)
Synthesized

sound

(tonal)
Synthesized

sound

(noise)
Reverb
Output 

Sound
F0
Estimation
Postprocessing
Pitch Recognition
Loudness Recognition
Postprocessing
Volume knob
Keys to control
Knobs for timbre control Synthesizer Sound
Music AI - Synthesis
Input 

Sound 

(Instrument,

Monophonic)
음량
생성된

신호 1

(기본음, 

배음)
생성된

신호 1 

(잡음)
잔향 

처리 

• Decoder == Synthesizer
• Encoder == Listens to the input sound and figure out how to set the knobs 🎛, by (Z), to mimic the input
• Pitch and loudness have nothing to do with the core of DDSP.
• DDSP focuses on what pitch/loudness don't describe (=timbre)
후처리
Loudness Recognition
Postprocess
F0
estimation
Pitch Recognition
Output 

Sound
DDSP Explained
timbre
Input 

sound 

(instrument,

monophonic)
음량
생성된

신호 1

(기본음, 

배음)
생성된

신호 1 

(잡음)
잔향 

처리 

후처리
Loudness recognition
Postprocess
F0
estimation
Pitch recognition
Output 

sound
Music AI - Synthesis
Tone Transfer (https://magenta.tensorflow.org/tone-transfer)
• 1. Use a saxophone dataset to train a simplified DDSP.

• Now, the module became a synth saxophone that takes pitch/loudness as input

• 2. Mimic some saxophone playing with your voice, estimate the pitch and loudness, and put
it into the trained model.
Input 

sound 

(instrument,

monophonic)
음량
생성된

신호 1

(기본음, 

배음)
생성된

신호 1 

(잡음)
잔향 

처리 

후처리
Loudness recognition
Postprocess
F0
estimation
Pitch recognition
Output 

sound
Music AI - Synthesis
Tone Transfer (https://magenta.tensorflow.org/tone-transfer)
• 1. Use a saxophone dataset to train a simplified DDSP.

• Now, the module became a synth saxophone that takes pitch/loudness as input

• 2. Mimic some saxophone playing with your voice, estimate the pitch and loudness, and put
it into the trained model.
Saxophone Synthesizer
Input 

sound 

(instrument,

monophonic)
음량
생성된

신호 1

(기본음, 

배음)
생성된

신호 1 

(잡음)
잔향 

처리 

후처리
Loudness recognition
Postprocess
F0
estimation
Pitch recognition
Output 

sound
Music AI - Synthesis
Tone Transfer (https://magenta.tensorflow.org/tone-transfer)
• 1. Use a saxophone dataset to train a simplified DDSP.

• Now, the module became a synth saxophone that takes pitch/loudness as input

• 2. Mimic some saxophone playing with your voice, estimate the pitch and loudness, and put
it into the trained model.
Saxophone Synthesizer
Connected during
training only
Music AI - Synthesis
Music AI - Synthesis
Music AI - Synthesis
Other Applications
• Synthesis

• Drums, Piano, etc

• Singing voice synthesis, rapping synthesis

• With target voices

• At the right tempo / beat
Music AI - Synthesis
Other Applications
• Synthesis

• Drums, Piano, etc

• Singing voice synthesis, rapping synthesis

• With target voices

• At the right tempo / beat
👉
Music AI - Synthesis
Other Applications
• Synthesis

• Drums, Piano, etc

• Singing voice synthesis, rapping synthesis

• With target voices

• At the right tempo / beat
👉
Music AI - Synthesis
Other Applications
• Synthesis

• Drums, Piano, etc

• Singing voice synthesis, rapping synthesis

• With target voices

• At the right tempo / beat
👉
🎻 🗣
Music AI - Synthesis
Other Applications
• Synthesis

• Drums, Piano, etc

• Singing voice synthesis, rapping synthesis

• With target voices

• At the right tempo / beat
👉
🎻 🗣
Music AI - Synthesis
Other Applications
• Synthesis

• Drums, Piano, etc

• Singing voice synthesis, rapping synthesis

• With target voices

• At the right tempo / beat
👉
🎻 🗣
Music AI
Input  Output Information Signal
Signal
Analysis

(e.g., genre classification,

music similarity)
Audio Signal Processing

(e.g., automatic mixing,

source separation)
Information
Creation

(e.g., automatic composition,

lyric generation)
Audio Synthesis

(e.g., singing voice generation,

instrument sound synthesis )
Music AI
Input  Output Information Signal
Signal
Analysis

(e.g., genre classification,

music similarity)
Audio Signal Processing

(e.g., automatic mixing,

source separation)
Information
Creation

(e.g., automatic composition,

lyric generation)
Audio Synthesis

(e.g., singing voice generation,

instrument sound synthesis )
Music AI - Creation
..by a narrow definition
• Dumping the results ain't fun; we want to steer AI to get what we want.
Music AI - Creation
..by a narrow definition
• Dumping the results ain't fun; we want to steer AI to get what we want.
Composition
Model
Genre: Jazz 🎵 (Jazz music)
Music AI - Creation
..by a narrow definition
• Dumping the results ain't fun; we want to steer AI to get what we want.
Composition
Model
Genre: Jazz 🎵 (Jazz music)
Chord
generation
Chord progression starting with
"Dm7 G7"
CM7 C7 FM7 F7 Em7
EbM7 ..
Music AI - Creation
..by a narrow definition
• Dumping the results ain't fun; we want to steer AI to get what we want.
Composition
Model
Genre: Jazz 🎵 (Jazz music)
Chord
generation
Chord progression starting with
"Dm7 G7"
CM7 C7 FM7 F7 Em7
EbM7 ..
Accompaniment
generation
(some melody)
Accompaniment
(chords, rhythm, ..)
Background: Language models
Music AI - Creation
Background: Language models
Word
models
"cat", "dog"
"deep learning"
(0.2, 0.3), (0.23, 0.31)
(-1.2, -3.2)
Music AI - Creation
Background: Language models
Word
models
"cat", "dog"
"deep learning"
(0.2, 0.3), (0.23, 0.31)
(-1.2, -3.2)
Summarization
"Stock market news articles published by leading
companies are read by every trader to carry out their
trading activities as they provide real time and reliable
information about the organization. These news articles
"News article can be summarized effectively
using the proposed NLP model"
Music AI - Creation
Background: Language models
Word
models
"cat", "dog"
"deep learning"
(0.2, 0.3), (0.23, 0.31)
(-1.2, -3.2)
Summarization
"Stock market news articles published by leading
companies are read by every trader to carry out their
trading activities as they provide real time and reliable
information about the organization. These news articles
"News article can be summarized effectively
using the proposed NLP model"
Translation
"온라인으로 열리는 이 학회에서는 음악 지각 및 인지와 관련된 광
범위한 주제로 여러 논문 발표가 있을 예정이다."
This online conference will feature several papers on a wide
range of topics related to music perception and cognition.
Music AI - Creation
Music AI - Creation
Proposition: Similarity between Language and Music
• Language = Sequence of words (or we say so)

• We ignore all the other aspects that this definition doesn't include

• Music := Sequence of notes (we dare say so again)

• Let's also ignore timbre, lyrics, and culture / social aspects
word1 word2 word3 .. 🎵 🎶 🎶 🎵
• Text-based LSTM networks for Automatic Music Composition, Choi et al., 2016

• https://soundcloud.com/kchoi-research/sets/lstm-realbook-1-5

• Let the model "read" chord progression. Ok, write some chords?

• DEMO: "LSTM Realbook 4.mp3"
Music AI - Creation
Language models, but with music data
Result: ..G:7(b9) C:maj C:maj A:min A:min D:min7 D:min7 G:7(b9) G:7(b9) C:maj C:maj C:maj C:maj A:min7 A:min7
A:min7 A:min7 D:9 D:9 D:9 G:7(b9) | C:maj C:maj A:min A:min | D:min7 D:min7 G:7(b9) G:7(b9) | C:maj C:maj C:maj
C:maj |A:min7 A:min7 A:min7 A:min7 | D:9 D:9 D:9 D:9 | D:9 D:9 D:9 D:9 | D:7 D:7 D:7 D:7 | D:min7 D:min7 D:min7
D:min7 | G:7 G:7 G:7 G:7 | C:maj C:maj C:maj C:maj | C:7 C:7 C:7 C:7 | F:maj F:maj F:maj F:maj | F:min F:min F:min
F:min | C:maj C:maj C:maj C:maj | C:maj C:maj C:maj C:maj D:7 D:7 D:7 D:7 G:7 G:7 G:7 G:7 C:maj C:maj C:maj C:maj
C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj G:7 G:7 G:7 G:7 G:7 G:7 G:7 G:7 C:maj
C:maj C:maj C:maj C:maj C:maj C:maj C:maj G:7 G:7 G:7 G:7 G:7 G:7 G:7 G:7 C:maj C:maj C:maj C:maj C:maj C:maj
C:maj C:maj G:7 G:7 G:7 G:7 G:7 G:7 G:7 G:7 C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj
C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj
Model only generates the chords.
I made the rest for demo purposes.
• Text-based LSTM networks for Automatic Music Composition, Choi et al., 2016

• https://soundcloud.com/kchoi-research/sets/lstm-realbook-1-5

• Let the model "read" chord progression. Ok, write some chords?

• DEMO: "LSTM Realbook 4.mp3"
Music AI - Creation
Language models, but with music data
Result: ..G:7(b9) C:maj C:maj A:min A:min D:min7 D:min7 G:7(b9) G:7(b9) C:maj C:maj C:maj C:maj A:min7 A:min7
A:min7 A:min7 D:9 D:9 D:9 G:7(b9) | C:maj C:maj A:min A:min | D:min7 D:min7 G:7(b9) G:7(b9) | C:maj C:maj C:maj
C:maj |A:min7 A:min7 A:min7 A:min7 | D:9 D:9 D:9 D:9 | D:9 D:9 D:9 D:9 | D:7 D:7 D:7 D:7 | D:min7 D:min7 D:min7
D:min7 | G:7 G:7 G:7 G:7 | C:maj C:maj C:maj C:maj | C:7 C:7 C:7 C:7 | F:maj F:maj F:maj F:maj | F:min F:min F:min
F:min | C:maj C:maj C:maj C:maj | C:maj C:maj C:maj C:maj D:7 D:7 D:7 D:7 G:7 G:7 G:7 G:7 C:maj C:maj C:maj C:maj
C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj G:7 G:7 G:7 G:7 G:7 G:7 G:7 G:7 C:maj
C:maj C:maj C:maj C:maj C:maj C:maj C:maj G:7 G:7 G:7 G:7 G:7 G:7 G:7 G:7 C:maj C:maj C:maj C:maj C:maj C:maj
C:maj C:maj G:7 G:7 G:7 G:7 G:7 G:7 G:7 G:7 C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj
C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj
Model only generates the chords.
I made the rest for demo purposes.
Music AI - Creation
Language models, but with music data
• "Music Transformer: Generating Music with Long-Term Structure"

• https://magenta.tensorflow.org/music-transformer, Huang et al., ICLR 2019

• RNN → Transformer → Transformer with relative attention
Music AI - Creation
Language models, but with music data
• "Music Transformer: Generating Music with Long-Term Structure"

• https://magenta.tensorflow.org/music-transformer, Huang et al., ICLR 2019

• RNN → Transformer → Transformer with relative attention
Beyond adopting language models
• Music Transformer

• + relative attention

• MIDI-VAE: https://arxiv.org/abs/1809.07600 

• Accent, instrumentation, ..

• Pop Music Transformer: https://arxiv.org/abs/2002.00212 

• Information such as Beat / Downbeat / Bars is encoded in a "word"
Music AI - Creation
Discussion 🔥
Music AI - Creation
Discussion 🔥
How do we
consume music?
Music AI - Creation
Why do we
make music?
Discussion 🔥
How do we
consume music?
Music AI - Creation
Why do we
make music?
Discussion 🔥
How do we
consume music?
To what extent, an
AI should / could / would
assist human
composers?
Music AI - Creation
Why do we
make music?
Is it a fair use of music
when a model "listens" to
million of songs?
Discussion 🔥
How do we
consume music?
To what extent, an
AI should / could / would
assist human
composers?
Music AI - Creation
Why do we
make music?
Who owns the
copyright of
AI creation?
Is it a fair use of music
when a model "listens" to
million of songs?
Discussion 🔥
How do we
consume music?
To what extent, an
AI should / could / would
assist human
composers?
Music AI - Creation
Input  Output Information Signal
Signal
Analysis

(e.g., genre classification,

music similarity)
Audio Signal Processing

(e.g., automatic mixing,

source separation)
Information
Creation

(e.g., automatic composition,

lyric generation)
Audio Synthesis

(e.g., singing voice generation,

instrument sound synthesis )
Music AI
Input  Output Information Signal
Signal
Analysis

(e.g., genre classification,

music similarity)
Audio Signal Processing

(e.g., automatic mixing,

source separation)
Information
Creation

(e.g., automatic composition,

lyric generation)
Audio Synthesis

(e.g., singing voice generation,

instrument sound synthesis )
Music AI
Music AI - Audio Signal Processing
Background: Music Source Separation
MSS Model
Music AI - Audio Signal Processing
DEMO: Vocal Source Separation
Music AI - Audio Signal Processing
DEMO: Vocal Source Separation
Traditional MSS had many assumptions
- Vocals are mixed at the center.
- Percussive instruments == flat over freq axis
- The lowest pitched sound == bass
- Different instruments are at the different
location (=angle and distance).
- Each instrument has characteristic frequency
energy distribution,
- which is invariant to pitch change.
- ..
Music AI - Audio Signal Processing
Recent MSS models only have high-level assumptions
- Human auditory systems are insensitive to the
absolute phase of sound.
- i) So let's not use it.
- ii) But we can still use it.
Music AI - Audio Signal Processing
Recent MSS models only have high-level assumptions
- Human auditory systems are insensitive to the
absolute phase of sound.
- i) So let's not use it.
- ii) But we can still use it.
Music AI - Audio Signal Processing
- Instruments have unique sound,
- which can be recognized within ? seconds.
- Instruments are discrete; and distinguishable.
Applications
• Enhance target signals

• Source separation

• Speech enhancement

• De-reverberation

• Automatic mixing, mastering, effect (DEMO; next slide - "Steerable discovery
of neural audio effects", 2021, Steinmetz and Reiss)

• Voice conversion
Music AI - Audio Signal Processing
Input  Output Information Signal
Signal
Analysis

(e.g., genre classification,

music similarity)
Audio Signal Processing

(e.g., automatic mixing,

source separation)
Information
Creation

(e.g., automatic composition,

lyric generation)
Audio Synthesis

(e.g., singing voice generation,

instrument sound synthesis )
Music AI
Input  Output Information Signal
Signal
Analysis

(e.g., genre classification,

music similarity)
Audio Signal Processing

(e.g., automatic mixing,

source separation)
Information
Creation

(e.g., automatic composition,

lyric generation)
Audio Synthesis

(e.g., singing voice generation,

instrument sound synthesis )
Music AI
Music AI - Analysis
MIR (Music Information
Retrieval),
Machine Listening:
All different kinds of
classification, recognition,
detection, ..
Music AI - Analysis
MIR (Music Information
Retrieval),
Machine Listening:
All different kinds of
classification, recognition,
detection, ..
This music is "Jazz"
The mood of this music is
"calm"
There are drums, piano and
bass
Tempo = 75 BPM
It's instrumental
Intro: 0:00 - 0:45
Bridge: 0:45 - 1:27, ..
Input  Output Information Signal
Signal
Analysis

(e.g., genre classification,

music similarity)
Audio
Signal
Process
ing

(e.g.,
automat
ic
mixing,

source
separati
Information
Creation

(e.g., automatic composition,

lyric generation)
Audio
Synthes
is

Music AI
Input  Output Information Signal
Signal
Analysis

(e.g., genre classification,

music similarity)
Audio
Signal
Process
ing

(e.g.,
automat
ic
mixing,

source
separati
Information
Creation

(e.g., automatic composition,

lyric generation)
Audio
Synthes
is

1. Timbre
2. Notes
3. Lyrics
Music AI
Timbre Understanding
Timbre Understanding
MFCC, The Classic (1970s - Now)
• Mel-Frequecy Cepstral Coefficients

Represent our perceptual frequency response with some numbers (=a vector)
Timbre Understanding
MFCC, The Classic (1970s - Now)
• Mel-Frequecy Cepstral Coefficients

Represent our perceptual frequency response with some numbers (=a vector)
• Mel-Frequecy Cepstral Coefficients

→ Represent human's perceptual frequency sensing with some numbers (=a vector)
Timbre Understanding
MFCC, The Classic (1970s - Now)
• Mel-Frequecy Cepstral Coefficients

Represent our perceptual frequency response with some numbers (=a vector)
👂
Auditory modeling

using some formula
0.1 0.9 0.2 0.8
• Mel-Frequecy Cepstral Coefficients

→ Represent human's perceptual frequency sensing with some numbers (=a vector)
Timbre Understanding
MFCC, The Classic (1970s - Now)
• Mel-Frequecy Cepstral Coefficients

Represent our perceptual frequency response with some numbers (=a vector)
👂
Auditory modeling

using some formula
0.1 0.9 0.2 0.8
👂 -1.0 1.7 0.3 0.8
• Mel-Frequecy Cepstral Coefficients

→ Represent human's perceptual frequency sensing with some numbers (=a vector)
MFCC20 of the first frame
MFCC20 of the second frame
Timbre Understanding
MFCC, The Classic (1970s - Now)
Notes Why?
Timbre Understanding
MFCC, The Classic (1970s - Now)
Notes Why?
Designed to be pitch-invariant
So that (speech recognition) works 

regardless of the pitch range of speakers
Timbre Understanding
MFCC, The Classic (1970s - Now)
Notes Why?
Designed to be pitch-invariant
So that (speech recognition) works 

regardless of the pitch range of speakers
The first value of MFCC represents energy of
the sound and is often omitted
So that it works 

regardless of how loud the speech is
Timbre Understanding
MFCC, The Classic (1970s - Now)
Notes Why?
Designed to be pitch-invariant
So that (speech recognition) works 

regardless of the pitch range of speakers
The first value of MFCC represents energy of
the sound and is often omitted
So that it works 

regardless of how loud the speech is
Designed for speech signals,

widely used for music as well.
MFCC (often) has the property we need in music analysis!

(e.g., Genre / Mood of music remains the same 

even if the key / volume of music changes)
Timbre Understanding
MFCC, The Classic (1970s - Now)
• Designed to be pitch-invariant

• Remove the loudness-related part

• Therefore, MFCC should be about timbre!
Timbre Understanding
MFCC, The Classic (1970s - Now)
Convolutional Neural Networks
Timbre Understanding
Convolutional Neural Networks
Timbre Understanding
1. Even if the text is blurry,
2. Regardless of where it is,
(as long as it looks so,)
the model should do the job.
Text / object
recognition?
Timbre Understanding
Convolutional Neural Networks
1. Even if the text is blurry,
2. Regardless of where it is,
(as long as it looks so,)
the model should do the job.
Text / object
recognition?
1. Even if the volume is low,
2. Regardless of the key/pitch
(as long as it sounds so,)
the model should do the job
Music genre
classification?
Timbre Understanding
Convolutional Neural Networks
1. Even if the text is blurry,
2. Regardless of where it is,
(as long as it looks so,)
the model should do the job.
Text / object
recognition?
1. Even if the volume is low,
2. Regardless of the key/pitch
(as long as it sounds so,)
the model should do the job
Music genre
classification?
Timbre Understanding
Convolutional Neural Networks
Texture
Texture
(timbre)
Timbre Understanding
Convolutional Neural Networks
• Unlike popular misunderstandings,

• Neural networks =/= human nerve systems

• Convnets =/= How we see

• Convnets =/= Vision
Timbre Understanding
Convolutional Neural Networks
• Unlike popular misunderstandings,

• Neural networks =/= human nerve systems

• Convnets =/= How we see

• Convnets =/= Vision
• Convnets:

• Designed to be sensitive to some aspects of the input data; while invariant to some others
(small local changes)

• They are somewhat similar to how we recognize music

• In particular, they're good at capturing timbre.
Timbre Understanding
Convolutional Neural Networks
• Automatic tagging using deep convolutional neural networks, Choi et al.,
ISMIR, 2016

• Borrowed the VGGNet to music as it is.
Timbre Understanding
Convolutional Neural Networks
• Music tagging

• Genre classification

• Mood recognition

• Instrument
recognition

• Similarity learning
Timbre Understanding
Convolutional Neural Networks
• Music tagging

• Genre classification

• Mood recognition

• Instrument
recognition

• Similarity learning
No one has declared it's about timbre understanding.
But, many people proposes a model as if it is.
Timbre Understanding
Convolutional Neural Networks
• Music tagging

• Genre classification

• Mood recognition

• Instrument
recognition

• Similarity learning
The limit of this perspective has been
overlooked since Convnets worked so well.
No one has declared it's about timbre understanding.
But, many people proposes a model as if it is.
Timbre Understanding
Convolutional Neural Networks
• Music tagging

• Genre classification

• Mood recognition

• Instrument
recognition

• Similarity learning
Also, it's a good reminder of the importance
of timbre in our musical perception.
The limit of this perspective has been
overlooked since Convnets worked so well.
No one has declared it's about timbre understanding.
But, many people proposes a model as if it is.
Timbre Understanding
Convolutional Neural Networks
Note-level Understanding
Note-level Understanding
Note-level Understanding
F0 (Fundamental Frequency) Estimation):
Monophonic. Voice / Single instrument
Note-level Understanding
F0 (Fundamental Frequency) Estimation):
Monophonic. Voice / Single instrument
Melody Extraction
- The definition of melody is subjective
- Based on mixture music
Note-level Understanding
F0 (Fundamental Frequency) Estimation):
Monophonic. Voice / Single instrument
Melody Extraction
- The definition of melody is subjective
- Based on mixture music
Transcription
-defined by target instrument / recording
environment / mono or polyphonic / ..
• Endless tuning of models..

• Method 1 → Method 1' → 1'se → 1'se Max → 1'se Max Plus → ..

• by adding assumptions on and on (distribution of notes, property of sound, ..)

• with more complicated / specialized models

• with reported performance improved

• as a result of people focusing on improving on a certain dataset

• The practicality went up? down? 🤔
Transcription research before deep learning (-2015)
Note-level Understanding
Transcription after deep learning
• CNNs and RNNs were already there; so were transcription models based on
them. They were doing well.

• Then we had a breakthrough!

• Onsets-and-frames (Hawthorne et al, 2018)
Note-level Understanding
Note-level Understanding
Transcription after deep learning
Note-level Understanding
Transcription after deep learning
Module 1:
Onset
Model
(duration is ignored)
Note-level Understanding
Transcription after deep learning
Module 1:
Onset
Model
(duration is ignored)
Note-level Understanding
Transcription after deep learning
Module 1:
Onset
Model
(duration is ignored)
Onset
Note-level Understanding
Transcription after deep learning
Module 1:
Onset
Model
(duration is ignored)
Module 2
Onset
Note-level Understanding
Transcription after deep learning
Module 1:
Onset
Model
(duration is ignored)
Module 3:
Frame Model
(to estimate duration;
conditioned on
onset prediction
Module 2
Onset
Note-level Understanding
Transcription after deep learning
Module 1:
Onset
Model
(duration is ignored)
Module 3:
Frame Model
(to estimate duration;
conditioned on
onset prediction
Module 2
Onset
Note-level Understanding
Transcription after deep learning
Module 1:
Onset
Model
(duration is ignored)
Module 3:
Frame Model
(to estimate duration;
conditioned on
onset prediction
Module 2
Onset
Frame
Note-level Understanding
Transcription after deep learning
Module 1:
Onset
Model
(duration is ignored)
Module 3:
Frame Model
(to estimate duration;
conditioned on
onset prediction
Module 2
Onset
Frame
1. It's beneficial to
teach onsets and
frames separately.
Note-level Understanding
Transcription after deep learning
Module 1:
Onset
Model
(duration is ignored)
Module 3:
Frame Model
(to estimate duration;
conditioned on
onset prediction
Module 2
Onset
Frame
1. It's beneficial to
teach onsets and
frames separately.
2. It helps to predict
onsets; and then
frames conditioned
on onsets
Note-level Understanding
Transcription after deep learning
Module 1:
Onset
Model
(duration is ignored)
Module 3:
Frame Model
(to estimate duration;
conditioned on
onset prediction
Module 2
Onset
Frame
1. It's beneficial to
teach onsets and
frames separately.
2. It helps to predict
onsets; and then
frames conditioned
on onsets
3. Melspectrograms
are good enough.
Note-level Understanding
Transcription after deep learning
Model  Dataset
MAPS
(2010)
MAPS w/ diff
config/metric
Maestro
(2018)
FitzGerald et al.

2008
58
Vincent et al.

2010
67
Ewert et al. 2016 95
Kelz et al. 2016 79 51 81
Hawthorne et al.
2018
83 82
Hawthorne et al.
2019
83 86 - 95
Kong et al.
2020
97
Note-level Understanding
🚨 DISCLAIMER: Some numbers are probably incorrect.

The metrics / datasets are complicated 🤯
Model  Dataset
MAPS
(2010)
MAPS w/ diff
config/metric
Maestro
(2018)
FitzGerald et al.

2008
58
Vincent et al.

2010
67
Ewert et al. 2016 95
Kelz et al. 2016 79 51 81
Hawthorne et al.
2018
83 82
Hawthorne et al.
2019
83 86 - 95
Kong et al.
2020
97
Deep learning
Note-level Understanding
🚨 DISCLAIMER: Some numbers are probably incorrect.

The metrics / datasets are complicated 🤯
Model  Dataset
MAPS
(2010)
MAPS w/ diff
config/metric
Maestro
(2018)
FitzGerald et al.

2008
58
Vincent et al.

2010
67
Ewert et al. 2016 95
Kelz et al. 2016 79 51 81
Hawthorne et al.
2018
83 82
Hawthorne et al.
2019
83 86 - 95
Kong et al.
2020
97
Deep learning
Better deep learning (Onsets-and-frames)
Note-level Understanding
🚨 DISCLAIMER: Some numbers are probably incorrect.

The metrics / datasets are complicated 🤯
DEMO: Real-time Piano Transcription (Kwon et al., 2020)
1. Recording
2. Transcription
3. Result
Note-level Understanding
DEMO: Real-time Piano Transcription (Kwon et al., 2020)
1. Recording
2. Transcription
3. Result
Note-level Understanding
Next Paradigm: Analysis-and-Synthesis; jointly
• 🎹 wave2midi2wave (Hawthorne et al., 2019)

• Utilized a paired midi-audio dataset

• 🥁 DrummerNet (Choi and Cho, 2019)

• Audio-only; unsupervised learning

• Followed by: guitar (Wiggins and Kim, 2020) and piano (Cheuk et al., 2021,
Benetos et al., 2021)
Note-level Understanding
DrummerNet
Note-level Understanding
DrummerNet
Module 1: Transcription
Synthesize drum signals
(not trainable; not deep learning)
Note-level Understanding
DrummerNet
Module 1: Transcription
Synthesize drum signals
(not trainable; not deep learning)
1. If the transcription
works well,
Note-level Understanding
DrummerNet
Module 1: Transcription
Synthesize drum signals
(not trainable; not deep learning)
1. If the transcription
works well,
2. The synthesized audio
based on the
transcription should be..
Note-level Understanding
DrummerNet
Module 1: Transcription
Synthesize drum signals
(not trainable; not deep learning)
1. If the transcription
works well,
2. The synthesized audio
based on the
transcription should be..
3. Similar to the
input audio!
Note-level Understanding
DrummerNet
Module 1: Transcription
Synthesize drum signals
(not trainable; not deep learning)
1. If the transcription
works well,
2. The synthesized audio
based on the
transcription should be..
3. Similar to the
input audio!
(i.e., autoencoder)
Note-level Understanding
Discussion: Why always piano/drums? [1/3]
• Models work well for 🎹 piano and 🥁 drums

• Piano and drums; both have great virtual instruments.

• Large datasets are all midi-based synthetic ones
Note-level Understanding
Discussion [2/3]
• Models don't work well for instruments with time-varying nature.

(🎷🎺 horns and woodwind, 🎻 strings)

• Reason: Lack of training data; Why? because midi sucks for those instruments

• The "time-varying" nature is partly instrument inherent; but also from the information
players add to the score

• This is because the limit of the information represented in "scores📄"

• Do scores and notes really matters in pop music? Do DJs care? Electric guitarists?
Note-level Understanding
Discussion [3/3]
Note-level Understanding
• The need for special models for various instruments → not great

• Are there clear boundaries between instruments? → not always
Lyric Understanding
Lyric Understanding
Lyric Alignment
Lyric Alignment
• Sequence alignment is a VERY popular problem.

• Text, DNA, speech, music, ..

• Methods are waiting to be imported to music.

• If needed, vocal separation works so well.

• Public datasets are relatively small.

• Seems like it's a lot advanced in industry where karaoke is a thing.
Lyric Understanding
Lyric transcription
• Methods are all there already.

• Pre-trained speech recognition models don't work.

• We just need to train a model for this task but..

• Data 😭...😭😭😭

• Copyright 😭😭......
Lyric Understanding
Conclusion
Input  Output Information Signal
Signal Analysis Audio Signal Processing
Information Creation Audio Synthesis
Music AI
Loudness, Pitch, and the rest (=Timbre)
Three Components of Sound
Remark
Remark
• We've tried speech / language models quite enough. Let's focus on the difference!

• Unlike speech, music is polyphonic; a lot more poly-timbral.

• Compared to language, the information in the score is a lot more limited
Remark
• We've tried speech / language models quite enough. Let's focus on the difference!

• Unlike speech, music is polyphonic; a lot more poly-timbral.

• Compared to language, the information in the score is a lot more limited
• Music datasets are "tiny"; but that's a part of the problem we should solve.
Remark
• We've tried speech / language models quite enough. Let's focus on the difference!

• Unlike speech, music is polyphonic; a lot more poly-timbral.

• Compared to language, the information in the score is a lot more limited
• Music datasets are "tiny"; but that's a part of the problem we should solve.
• Unlike language / speech / images / videos,

• The music creation process is heavily, and nicely digitized.

• → While they crawl, we can synthesize.
Like music AI?
Like music AI?
• ISMIR ♥ - International Society of Music Information Retrieval
Like music AI?
• ISMIR ♥ - International Society of Music Information Retrieval
• Creativity Workshops in NeurIPS / ICML
Like music AI?
• ISMIR ♥ - International Society of Music Information Retrieval
• Creativity Workshops in NeurIPS / ICML
• ICASSP, SMC
Like music AI?
• ISMIR ♥ - International Society of Music Information Retrieval
• Creativity Workshops in NeurIPS / ICML
• ICASSP, SMC
Like music AI?
• ISMIR ♥ - International Society of Music Information Retrieval
• Creativity Workshops in NeurIPS / ICML
• ICASSP, SMC
• Lab showcase at ISMIR2021: 

https://ismir2021.ismir.net/labshowcase/
All you need is AI and music
DS-GA 1011 F'21
Keunwoo Choi, , 2021-12-08
- The End -
All you need is AI and music
DS-GA 1011 F'21
Keunwoo Choi, , 2021-12-08

More Related Content

What's hot

Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
Amrita More
 
Anvita Audio Classification Presentation
Anvita Audio Classification PresentationAnvita Audio Classification Presentation
Anvita Audio Classification Presentation
guest6e7a1b1
 

What's hot (20)

Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Automatic Music Transcription
Automatic Music TranscriptionAutomatic Music Transcription
Automatic Music Transcription
 
Speaker Recognition
Speaker RecognitionSpeaker Recognition
Speaker Recognition
 
Speech Signal Processing
Speech Signal ProcessingSpeech Signal Processing
Speech Signal Processing
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognition
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
 
Anvita Audio Classification Presentation
Anvita Audio Classification PresentationAnvita Audio Classification Presentation
Anvita Audio Classification Presentation
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
 
Sentiwordnet [IIT-Bombay]
Sentiwordnet [IIT-Bombay]Sentiwordnet [IIT-Bombay]
Sentiwordnet [IIT-Bombay]
 
audio
audioaudio
audio
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at Spotify
 
Speech synthesis technology
Speech synthesis technologySpeech synthesis technology
Speech synthesis technology
 
Speech Synthesis.pptx
Speech Synthesis.pptxSpeech Synthesis.pptx
Speech Synthesis.pptx
 
Lip Reading.pptx
Lip Reading.pptxLip Reading.pptx
Lip Reading.pptx
 
Vectorscope
VectorscopeVectorscope
Vectorscope
 
Audio Fundamentals
Audio Fundamentals Audio Fundamentals
Audio Fundamentals
 
Fingerprint Recognition Technique(PPT)
Fingerprint Recognition Technique(PPT)Fingerprint Recognition Technique(PPT)
Fingerprint Recognition Technique(PPT)
 
Linear Predictive Coding
Linear Predictive CodingLinear Predictive Coding
Linear Predictive Coding
 
Gabor Filtering for Fingerprint Image Enhancement
Gabor Filtering for Fingerprint Image EnhancementGabor Filtering for Fingerprint Image Enhancement
Gabor Filtering for Fingerprint Image Enhancement
 
speech processing and recognition basic in data mining
speech processing and recognition basic in  data miningspeech processing and recognition basic in  data mining
speech processing and recognition basic in data mining
 

Similar to "All you need is AI and music" by Keunwoo Choi

Alpan Aytekin-Game Audio Essentials
Alpan Aytekin-Game Audio EssentialsAlpan Aytekin-Game Audio Essentials
Alpan Aytekin-Game Audio Essentials
gamedevelopersturkey
 
Sound Effects Presentation
Sound Effects PresentationSound Effects Presentation
Sound Effects Presentation
phele1994
 
Pokemon Black and White vs N Final Battle Analysis
Pokemon Black and White vs N Final Battle AnalysisPokemon Black and White vs N Final Battle Analysis
Pokemon Black and White vs N Final Battle Analysis
Jason
 
The role of a sound designer
The role of a sound designerThe role of a sound designer
The role of a sound designer
joemountain1
 

Similar to "All you need is AI and music" by Keunwoo Choi (20)

20211026 taicca 1 intro to mir
20211026 taicca 1 intro to mir20211026 taicca 1 intro to mir
20211026 taicca 1 intro to mir
 
Alpan Aytekin-Game Audio Essentials
Alpan Aytekin-Game Audio EssentialsAlpan Aytekin-Game Audio Essentials
Alpan Aytekin-Game Audio Essentials
 
Surround Sound
Surround SoundSurround Sound
Surround Sound
 
Sound Effects Presentation
Sound Effects PresentationSound Effects Presentation
Sound Effects Presentation
 
audio-production-1231352387673755-2.ppt
audio-production-1231352387673755-2.pptaudio-production-1231352387673755-2.ppt
audio-production-1231352387673755-2.ppt
 
Тарас Терлецький "Як організувати роботу з вашим дизайнером по звуку" GameDe...
Тарас Терлецький "Як організувати роботу з вашим дизайнером по звуку"  GameDe...Тарас Терлецький "Як організувати роботу з вашим дизайнером по звуку"  GameDe...
Тарас Терлецький "Як організувати роботу з вашим дизайнером по звуку" GameDe...
 
Game Audio Post-Production
Game Audio Post-ProductionGame Audio Post-Production
Game Audio Post-Production
 
Machine Learning for Creative AI Applications in Music (2018 May)
Machine Learning for Creative AI Applications in Music (2018 May)Machine Learning for Creative AI Applications in Music (2018 May)
Machine Learning for Creative AI Applications in Music (2018 May)
 
Audio Production
Audio ProductionAudio Production
Audio Production
 
楊奕軒/音樂資料檢索
楊奕軒/音樂資料檢索楊奕軒/音樂資料檢索
楊奕軒/音樂資料檢索
 
Multi media unit-2.doc
Multi media unit-2.docMulti media unit-2.doc
Multi media unit-2.doc
 
Video Game Music Overview
Video Game Music OverviewVideo Game Music Overview
Video Game Music Overview
 
Joe P Audio Donation Fund
Joe P Audio Donation FundJoe P Audio Donation Fund
Joe P Audio Donation Fund
 
MIR
MIRMIR
MIR
 
Pokemon Black and White vs N Final Battle Analysis
Pokemon Black and White vs N Final Battle AnalysisPokemon Black and White vs N Final Battle Analysis
Pokemon Black and White vs N Final Battle Analysis
 
The role of a sound designer
The role of a sound designerThe role of a sound designer
The role of a sound designer
 
Emex 2013 Chillout Production Masterclass
Emex 2013 Chillout Production MasterclassEmex 2013 Chillout Production Masterclass
Emex 2013 Chillout Production Masterclass
 
Ism2011
Ism2011Ism2011
Ism2011
 
Adaptive Music and Interactive Audio
Adaptive Music and Interactive AudioAdaptive Music and Interactive Audio
Adaptive Music and Interactive Audio
 
Surround sount system
Surround sount systemSurround sount system
Surround sount system
 

More from Keunwoo Choi

More from Keunwoo Choi (11)

가상현실을 위한 오디오 기술
가상현실을 위한 오디오 기술가상현실을 위한 오디오 기술
가상현실을 위한 오디오 기술
 
Conditional generative model for audio
Conditional generative model for audioConditional generative model for audio
Conditional generative model for audio
 
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, ExpectDeep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
 
Convolutional recurrent neural networks for music classification
Convolutional recurrent neural networks for music classificationConvolutional recurrent neural networks for music classification
Convolutional recurrent neural networks for music classification
 
The effects of noisy labels on deep convolutional neural networks for music t...
The effects of noisy labels on deep convolutional neural networks for music t...The effects of noisy labels on deep convolutional neural networks for music t...
The effects of noisy labels on deep convolutional neural networks for music t...
 
dl4mir tutorial at ETRI, Korea
dl4mir tutorial at ETRI, Koreadl4mir tutorial at ETRI, Korea
dl4mir tutorial at ETRI, Korea
 
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
 
Deep Convolutional Neural Networks - Overview
Deep Convolutional Neural Networks - OverviewDeep Convolutional Neural Networks - Overview
Deep Convolutional Neural Networks - Overview
 
Deep learning for music classification, 2016-05-24
Deep learning for music classification, 2016-05-24Deep learning for music classification, 2016-05-24
Deep learning for music classification, 2016-05-24
 
딥러닝 개요 (2015-05-09 KISTEP)
딥러닝 개요 (2015-05-09 KISTEP)딥러닝 개요 (2015-05-09 KISTEP)
딥러닝 개요 (2015-05-09 KISTEP)
 
Understanding Music Playlists
Understanding Music PlaylistsUnderstanding Music Playlists
Understanding Music Playlists
 

Recently uploaded

Recently uploaded (20)

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 

"All you need is AI and music" by Keunwoo Choi

  • 1. All you need is AI and music DS-GA 1011 F'21 Keunwoo Choi, , 2021-12-08
  • 2. Keunwoo Choi keunwoochoi.github.io @keunwoochoi • 🎶 Research scientist at ByteDance/TikTok • 🎵 Research scientist at Spotify • 🎼 PhD program, Queen Mary University of London • 🔈 Acoustic research engineer, ETRI, Korea • 🎧 Applied Acoustics (Master's program), Seoul National Univ. • 🎸 EECS (Bs), Seoul National Univ. Until Now 2020 2018 2014 2011 2009
  • 3. https://www.youtube.com/channel/UC6WGQvwwM3M7sX98zJ14XPA Honor Code on paying attention to Keunwoo Choi’s music As a student of "Natural Language Processing with Representation Learning", I - listened to all the music (0:00 to the end) Keunwoo Choi uploaded on his YouTube channel, - clicked "like" an odd-number times, - clicked "subscribe" button an odd-number times, - turned on the notification, and - shared the channel and your top-30 favorite tracks. Signature __________ Name __________ Date __________
  • 4. Abstract 🍃 "..What is AI, and music AI? In this talk, we review the trend in music AI in four categories - Analysis / Creation / Signal Synthesis / Signal Processing. We put a special focus on Analysis; of timbre, notes, and lyrics. Our goal is to understand what music AI researchers aim, assume, develop, overlook, and misunderstand."
  • 5. Abstract 🍃 "..What is AI, and music AI? In this talk, we review the trend in music AI in four categories - Analysis / Creation / Signal Synthesis / Signal Processing. We put a special focus on Analysis; of timbre, notes, and lyrics. Our goal is to understand what music AI researchers aim, assume, develop, overlook, and misunderstand."
  • 6. Content • Music AI [35 min] • Analysis / Creation / Signal Synthesis / Signal Processing • Analysis: [30 min] • Timbral Understanding [15 min] • Note-level Understanding [10 min] • Lyric Understanding [ 5 min]
  • 7. Content • Music AI [35 min] • Analysis / Creation / Signal Synthesis / Signal Processing • Analysis: [30 min] • Timbral Understanding [15 min] • Note-level Understanding [10 min] • Lyric Understanding [ 5 min]
  • 8. All you need is AI and music DS-GA 1011 F'21 Keunwoo Choi, , 2021-12-08
  • 9. All you need is AI and music DS-GA 1011 F'21 Keunwoo Choi, , 2021-12-08
  • 12. Music AI • Machines doing something musical as a response of some musical inputs (ICMC '84)
  • 13. Music AI • Machines doing something musical as a response of some musical inputs (ICMC '84)
  • 14. Music AI • Machines doing something musical as a response of some musical inputs (ICMC '84)
  • 15. Music AI Input Output Information Signal Signal Analysis (e.g., genre classification, music similarity) Audio Signal Processing (e.g., automatic mixing, source separation) Information Creation (e.g., automatic composition, lyric generation) Audio Synthesis (e.g., singing voice generation, instrument sound synthesis )
  • 16. Music AI Input Output Information Signal Signal Analysis (e.g., genre classification, music similarity) Audio Signal Processing (e.g., automatic mixing, source separation) Information Creation (e.g., automatic composition, lyric generation) Audio Synthesis (e.g., singing voice generation, instrument sound synthesis )
  • 17. Music AI - Synthesis Background: Synthesizer
  • 18. Music AI - Synthesis Background: Synthesizer
  • 19. Music AI - Synthesis Background: Synthesizer Volume knob Keys to control the pitch Many knobs to control the timbre
  • 20. Music AI - Synthesis Background: Synthesizer
  • 21. Music AI - Synthesis Background: Synthesizer
  • 22. Music AI - Synthesis Background: Synthesizer
  • 23. Music AI - Synthesis Background: Synthesizer Volume knob Keys to control the pitch Many knobs for timbre control Synthesizer THE SOUND YOU WANT
  • 24. Three Components of Sound Loudness, Pitch, and the rest (=Timbre) • Timbre: "that attribute of auditory sensation which enables a listener to judge that two nonidentical sounds, similarly presented and having the same loudness and pitch, are dissimilar" (Acoustical Society of America) • Sound (that we perceive) := Loudness, Pitch, and the rest (Timbre) Volume knob Keys to pitch Knobs to timbre Synthesizer The sound
  • 25. Three Components of Sound Loudness, Pitch, and the rest (=Timbre) • Timbre: "that attribute of auditory sensation which enables a listener to judge that two nonidentical sounds, similarly presented and having the same loudness and pitch, are dissimilar" (Acoustical Society of America) • Sound (that we perceive) := Loudness, Pitch, and the rest (Timbre) Volume knob Keys to pitch Knobs to timbre Synthesizer The sound
  • 26. Autoencoder Music AI - Synthesis Background: Autoencoder Input (28 x 28 = 784pixel) Output (28 x 28 = 784 pixel) Module 1 (Encoder) Module 2 (Decoder) 16D
  • 27. Autoencoder Music AI - Synthesis Background: Autoencoder Input (28 x 28 = 784pixel) Output (28 x 28 = 784 pixel) Module 1 (Encoder) Module 2 (Decoder) 16D Compress the input in some sense Decompress
  • 28. Music AI - Synthesis DDSP - Differential Digital Signal Processing (2019, Engel et al.)
  • 29. Music AI - Synthesis DDSP Explained Input Sound (instrument, monophonic) Output Sound
  • 30. Music AI - Synthesis DDSP Explained Input Sound (instrument, monophonic) Synthesized sound (tonal) Synthesized sound (noise) Reverb Output Sound
  • 31. Music AI - Synthesis DDSP Explained Input Sound (instrument, monophonic) Synthesized sound (tonal) Synthesized sound (noise) Reverb Output Sound F0 Estimation
  • 32. Music AI - Synthesis DDSP Explained Input Sound (instrument, monophonic) Synthesized sound (tonal) Synthesized sound (noise) Reverb Output Sound F0 Estimation Pitch Recognition
  • 33. Music AI - Synthesis DDSP Explained Input Sound (instrument, monophonic) Synthesized sound (tonal) Synthesized sound (noise) Reverb Output Sound F0 Estimation Pitch Recognition Loudness Recognition
  • 34. Music AI - Synthesis DDSP Explained Input Sound (instrument, monophonic) Synthesized sound (tonal) Synthesized sound (noise) Reverb Output Sound F0 Estimation Postprocessing Pitch Recognition Loudness Recognition
  • 35. Music AI - Synthesis DDSP Explained Input Sound (instrument, monophonic) Synthesized sound (tonal) Synthesized sound (noise) Reverb Output Sound F0 Estimation Postprocessing Pitch Recognition Loudness Recognition Postprocessing
  • 36. Music AI - Synthesis DDSP Explained Input Sound (instrument, monophonic) Synthesized sound (tonal) Synthesized sound (noise) Reverb Output Sound F0 Estimation Postprocessing Pitch Recognition Loudness Recognition Postprocessing Volume knob Keys to control Knobs for timbre control Synthesizer Sound
  • 37. Music AI - Synthesis DDSP Explained Input Sound (instrument, monophonic) Synthesized sound (tonal) Synthesized sound (noise) Reverb Output Sound F0 Estimation Postprocessing Pitch Recognition Loudness Recognition Postprocessing Volume knob Keys to control Knobs for timbre control Synthesizer Sound
  • 38. Music AI - Synthesis Input Sound (Instrument, Monophonic) 음량 생성된 신호 1 (기본음, 배음) 생성된 신호 1 (잡음) 잔향 처리 • Decoder == Synthesizer • Encoder == Listens to the input sound and figure out how to set the knobs 🎛, by (Z), to mimic the input • Pitch and loudness have nothing to do with the core of DDSP. • DDSP focuses on what pitch/loudness don't describe (=timbre) 후처리 Loudness Recognition Postprocess F0 estimation Pitch Recognition Output Sound DDSP Explained timbre
  • 39. Input sound (instrument, monophonic) 음량 생성된 신호 1 (기본음, 배음) 생성된 신호 1 (잡음) 잔향 처리 후처리 Loudness recognition Postprocess F0 estimation Pitch recognition Output sound Music AI - Synthesis Tone Transfer (https://magenta.tensorflow.org/tone-transfer) • 1. Use a saxophone dataset to train a simplified DDSP. • Now, the module became a synth saxophone that takes pitch/loudness as input • 2. Mimic some saxophone playing with your voice, estimate the pitch and loudness, and put it into the trained model.
  • 40. Input sound (instrument, monophonic) 음량 생성된 신호 1 (기본음, 배음) 생성된 신호 1 (잡음) 잔향 처리 후처리 Loudness recognition Postprocess F0 estimation Pitch recognition Output sound Music AI - Synthesis Tone Transfer (https://magenta.tensorflow.org/tone-transfer) • 1. Use a saxophone dataset to train a simplified DDSP. • Now, the module became a synth saxophone that takes pitch/loudness as input • 2. Mimic some saxophone playing with your voice, estimate the pitch and loudness, and put it into the trained model. Saxophone Synthesizer
  • 41. Input sound (instrument, monophonic) 음량 생성된 신호 1 (기본음, 배음) 생성된 신호 1 (잡음) 잔향 처리 후처리 Loudness recognition Postprocess F0 estimation Pitch recognition Output sound Music AI - Synthesis Tone Transfer (https://magenta.tensorflow.org/tone-transfer) • 1. Use a saxophone dataset to train a simplified DDSP. • Now, the module became a synth saxophone that takes pitch/loudness as input • 2. Mimic some saxophone playing with your voice, estimate the pitch and loudness, and put it into the trained model. Saxophone Synthesizer Connected during training only
  • 42. Music AI - Synthesis
  • 43. Music AI - Synthesis
  • 44. Music AI - Synthesis Other Applications • Synthesis • Drums, Piano, etc • Singing voice synthesis, rapping synthesis • With target voices • At the right tempo / beat
  • 45. Music AI - Synthesis Other Applications • Synthesis • Drums, Piano, etc • Singing voice synthesis, rapping synthesis • With target voices • At the right tempo / beat 👉
  • 46. Music AI - Synthesis Other Applications • Synthesis • Drums, Piano, etc • Singing voice synthesis, rapping synthesis • With target voices • At the right tempo / beat 👉
  • 47. Music AI - Synthesis Other Applications • Synthesis • Drums, Piano, etc • Singing voice synthesis, rapping synthesis • With target voices • At the right tempo / beat 👉 🎻 🗣
  • 48. Music AI - Synthesis Other Applications • Synthesis • Drums, Piano, etc • Singing voice synthesis, rapping synthesis • With target voices • At the right tempo / beat 👉 🎻 🗣
  • 49. Music AI - Synthesis Other Applications • Synthesis • Drums, Piano, etc • Singing voice synthesis, rapping synthesis • With target voices • At the right tempo / beat 👉 🎻 🗣
  • 50. Music AI Input Output Information Signal Signal Analysis (e.g., genre classification, music similarity) Audio Signal Processing (e.g., automatic mixing, source separation) Information Creation (e.g., automatic composition, lyric generation) Audio Synthesis (e.g., singing voice generation, instrument sound synthesis )
  • 51. Music AI Input Output Information Signal Signal Analysis (e.g., genre classification, music similarity) Audio Signal Processing (e.g., automatic mixing, source separation) Information Creation (e.g., automatic composition, lyric generation) Audio Synthesis (e.g., singing voice generation, instrument sound synthesis )
  • 52. Music AI - Creation ..by a narrow definition • Dumping the results ain't fun; we want to steer AI to get what we want.
  • 53. Music AI - Creation ..by a narrow definition • Dumping the results ain't fun; we want to steer AI to get what we want. Composition Model Genre: Jazz 🎵 (Jazz music)
  • 54. Music AI - Creation ..by a narrow definition • Dumping the results ain't fun; we want to steer AI to get what we want. Composition Model Genre: Jazz 🎵 (Jazz music) Chord generation Chord progression starting with "Dm7 G7" CM7 C7 FM7 F7 Em7 EbM7 ..
  • 55. Music AI - Creation ..by a narrow definition • Dumping the results ain't fun; we want to steer AI to get what we want. Composition Model Genre: Jazz 🎵 (Jazz music) Chord generation Chord progression starting with "Dm7 G7" CM7 C7 FM7 F7 Em7 EbM7 .. Accompaniment generation (some melody) Accompaniment (chords, rhythm, ..)
  • 57. Background: Language models Word models "cat", "dog" "deep learning" (0.2, 0.3), (0.23, 0.31) (-1.2, -3.2) Music AI - Creation
  • 58. Background: Language models Word models "cat", "dog" "deep learning" (0.2, 0.3), (0.23, 0.31) (-1.2, -3.2) Summarization "Stock market news articles published by leading companies are read by every trader to carry out their trading activities as they provide real time and reliable information about the organization. These news articles "News article can be summarized effectively using the proposed NLP model" Music AI - Creation
  • 59. Background: Language models Word models "cat", "dog" "deep learning" (0.2, 0.3), (0.23, 0.31) (-1.2, -3.2) Summarization "Stock market news articles published by leading companies are read by every trader to carry out their trading activities as they provide real time and reliable information about the organization. These news articles "News article can be summarized effectively using the proposed NLP model" Translation "온라인으로 열리는 이 학회에서는 음악 지각 및 인지와 관련된 광 범위한 주제로 여러 논문 발표가 있을 예정이다." This online conference will feature several papers on a wide range of topics related to music perception and cognition. Music AI - Creation
  • 60. Music AI - Creation Proposition: Similarity between Language and Music • Language = Sequence of words (or we say so) • We ignore all the other aspects that this definition doesn't include • Music := Sequence of notes (we dare say so again) • Let's also ignore timbre, lyrics, and culture / social aspects word1 word2 word3 .. 🎵 🎶 🎶 🎵
  • 61. • Text-based LSTM networks for Automatic Music Composition, Choi et al., 2016 • https://soundcloud.com/kchoi-research/sets/lstm-realbook-1-5 • Let the model "read" chord progression. Ok, write some chords? • DEMO: "LSTM Realbook 4.mp3" Music AI - Creation Language models, but with music data Result: ..G:7(b9) C:maj C:maj A:min A:min D:min7 D:min7 G:7(b9) G:7(b9) C:maj C:maj C:maj C:maj A:min7 A:min7 A:min7 A:min7 D:9 D:9 D:9 G:7(b9) | C:maj C:maj A:min A:min | D:min7 D:min7 G:7(b9) G:7(b9) | C:maj C:maj C:maj C:maj |A:min7 A:min7 A:min7 A:min7 | D:9 D:9 D:9 D:9 | D:9 D:9 D:9 D:9 | D:7 D:7 D:7 D:7 | D:min7 D:min7 D:min7 D:min7 | G:7 G:7 G:7 G:7 | C:maj C:maj C:maj C:maj | C:7 C:7 C:7 C:7 | F:maj F:maj F:maj F:maj | F:min F:min F:min F:min | C:maj C:maj C:maj C:maj | C:maj C:maj C:maj C:maj D:7 D:7 D:7 D:7 G:7 G:7 G:7 G:7 C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj G:7 G:7 G:7 G:7 G:7 G:7 G:7 G:7 C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj G:7 G:7 G:7 G:7 G:7 G:7 G:7 G:7 C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj G:7 G:7 G:7 G:7 G:7 G:7 G:7 G:7 C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj Model only generates the chords. I made the rest for demo purposes.
  • 62. • Text-based LSTM networks for Automatic Music Composition, Choi et al., 2016 • https://soundcloud.com/kchoi-research/sets/lstm-realbook-1-5 • Let the model "read" chord progression. Ok, write some chords? • DEMO: "LSTM Realbook 4.mp3" Music AI - Creation Language models, but with music data Result: ..G:7(b9) C:maj C:maj A:min A:min D:min7 D:min7 G:7(b9) G:7(b9) C:maj C:maj C:maj C:maj A:min7 A:min7 A:min7 A:min7 D:9 D:9 D:9 G:7(b9) | C:maj C:maj A:min A:min | D:min7 D:min7 G:7(b9) G:7(b9) | C:maj C:maj C:maj C:maj |A:min7 A:min7 A:min7 A:min7 | D:9 D:9 D:9 D:9 | D:9 D:9 D:9 D:9 | D:7 D:7 D:7 D:7 | D:min7 D:min7 D:min7 D:min7 | G:7 G:7 G:7 G:7 | C:maj C:maj C:maj C:maj | C:7 C:7 C:7 C:7 | F:maj F:maj F:maj F:maj | F:min F:min F:min F:min | C:maj C:maj C:maj C:maj | C:maj C:maj C:maj C:maj D:7 D:7 D:7 D:7 G:7 G:7 G:7 G:7 C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj G:7 G:7 G:7 G:7 G:7 G:7 G:7 G:7 C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj G:7 G:7 G:7 G:7 G:7 G:7 G:7 G:7 C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj G:7 G:7 G:7 G:7 G:7 G:7 G:7 G:7 C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj Model only generates the chords. I made the rest for demo purposes.
  • 63. Music AI - Creation Language models, but with music data • "Music Transformer: Generating Music with Long-Term Structure" • https://magenta.tensorflow.org/music-transformer, Huang et al., ICLR 2019 • RNN → Transformer → Transformer with relative attention
  • 64. Music AI - Creation Language models, but with music data • "Music Transformer: Generating Music with Long-Term Structure" • https://magenta.tensorflow.org/music-transformer, Huang et al., ICLR 2019 • RNN → Transformer → Transformer with relative attention
  • 65. Beyond adopting language models • Music Transformer • + relative attention • MIDI-VAE: https://arxiv.org/abs/1809.07600 • Accent, instrumentation, .. • Pop Music Transformer: https://arxiv.org/abs/2002.00212 • Information such as Beat / Downbeat / Bars is encoded in a "word" Music AI - Creation
  • 67. Discussion 🔥 How do we consume music? Music AI - Creation
  • 68. Why do we make music? Discussion 🔥 How do we consume music? Music AI - Creation
  • 69. Why do we make music? Discussion 🔥 How do we consume music? To what extent, an AI should / could / would assist human composers? Music AI - Creation
  • 70. Why do we make music? Is it a fair use of music when a model "listens" to million of songs? Discussion 🔥 How do we consume music? To what extent, an AI should / could / would assist human composers? Music AI - Creation
  • 71. Why do we make music? Who owns the copyright of AI creation? Is it a fair use of music when a model "listens" to million of songs? Discussion 🔥 How do we consume music? To what extent, an AI should / could / would assist human composers? Music AI - Creation
  • 72. Input Output Information Signal Signal Analysis (e.g., genre classification, music similarity) Audio Signal Processing (e.g., automatic mixing, source separation) Information Creation (e.g., automatic composition, lyric generation) Audio Synthesis (e.g., singing voice generation, instrument sound synthesis ) Music AI
  • 73. Input Output Information Signal Signal Analysis (e.g., genre classification, music similarity) Audio Signal Processing (e.g., automatic mixing, source separation) Information Creation (e.g., automatic composition, lyric generation) Audio Synthesis (e.g., singing voice generation, instrument sound synthesis ) Music AI
  • 74. Music AI - Audio Signal Processing Background: Music Source Separation MSS Model
  • 75. Music AI - Audio Signal Processing DEMO: Vocal Source Separation
  • 76. Music AI - Audio Signal Processing DEMO: Vocal Source Separation
  • 77. Traditional MSS had many assumptions - Vocals are mixed at the center. - Percussive instruments == flat over freq axis - The lowest pitched sound == bass - Different instruments are at the different location (=angle and distance). - Each instrument has characteristic frequency energy distribution, - which is invariant to pitch change. - .. Music AI - Audio Signal Processing
  • 78. Recent MSS models only have high-level assumptions - Human auditory systems are insensitive to the absolute phase of sound. - i) So let's not use it. - ii) But we can still use it. Music AI - Audio Signal Processing
  • 79. Recent MSS models only have high-level assumptions - Human auditory systems are insensitive to the absolute phase of sound. - i) So let's not use it. - ii) But we can still use it. Music AI - Audio Signal Processing - Instruments have unique sound, - which can be recognized within ? seconds. - Instruments are discrete; and distinguishable.
  • 80. Applications • Enhance target signals • Source separation • Speech enhancement • De-reverberation • Automatic mixing, mastering, effect (DEMO; next slide - "Steerable discovery of neural audio effects", 2021, Steinmetz and Reiss) • Voice conversion Music AI - Audio Signal Processing
  • 81.
  • 82.
  • 83. Input Output Information Signal Signal Analysis (e.g., genre classification, music similarity) Audio Signal Processing (e.g., automatic mixing, source separation) Information Creation (e.g., automatic composition, lyric generation) Audio Synthesis (e.g., singing voice generation, instrument sound synthesis ) Music AI
  • 84. Input Output Information Signal Signal Analysis (e.g., genre classification, music similarity) Audio Signal Processing (e.g., automatic mixing, source separation) Information Creation (e.g., automatic composition, lyric generation) Audio Synthesis (e.g., singing voice generation, instrument sound synthesis ) Music AI
  • 85. Music AI - Analysis MIR (Music Information Retrieval), Machine Listening: All different kinds of classification, recognition, detection, ..
  • 86. Music AI - Analysis MIR (Music Information Retrieval), Machine Listening: All different kinds of classification, recognition, detection, .. This music is "Jazz" The mood of this music is "calm" There are drums, piano and bass Tempo = 75 BPM It's instrumental Intro: 0:00 - 0:45 Bridge: 0:45 - 1:27, ..
  • 87. Input Output Information Signal Signal Analysis (e.g., genre classification, music similarity) Audio Signal Process ing (e.g., automat ic mixing, source separati Information Creation (e.g., automatic composition, lyric generation) Audio Synthes is Music AI
  • 88. Input Output Information Signal Signal Analysis (e.g., genre classification, music similarity) Audio Signal Process ing (e.g., automat ic mixing, source separati Information Creation (e.g., automatic composition, lyric generation) Audio Synthes is 1. Timbre 2. Notes 3. Lyrics Music AI
  • 90. Timbre Understanding MFCC, The Classic (1970s - Now) • Mel-Frequecy Cepstral Coefficients
 Represent our perceptual frequency response with some numbers (=a vector)
  • 91. Timbre Understanding MFCC, The Classic (1970s - Now) • Mel-Frequecy Cepstral Coefficients
 Represent our perceptual frequency response with some numbers (=a vector) • Mel-Frequecy Cepstral Coefficients
 → Represent human's perceptual frequency sensing with some numbers (=a vector)
  • 92. Timbre Understanding MFCC, The Classic (1970s - Now) • Mel-Frequecy Cepstral Coefficients
 Represent our perceptual frequency response with some numbers (=a vector) 👂 Auditory modeling using some formula 0.1 0.9 0.2 0.8 • Mel-Frequecy Cepstral Coefficients
 → Represent human's perceptual frequency sensing with some numbers (=a vector)
  • 93. Timbre Understanding MFCC, The Classic (1970s - Now) • Mel-Frequecy Cepstral Coefficients
 Represent our perceptual frequency response with some numbers (=a vector) 👂 Auditory modeling using some formula 0.1 0.9 0.2 0.8 👂 -1.0 1.7 0.3 0.8 • Mel-Frequecy Cepstral Coefficients
 → Represent human's perceptual frequency sensing with some numbers (=a vector)
  • 94. MFCC20 of the first frame MFCC20 of the second frame Timbre Understanding MFCC, The Classic (1970s - Now)
  • 95. Notes Why? Timbre Understanding MFCC, The Classic (1970s - Now)
  • 96. Notes Why? Designed to be pitch-invariant So that (speech recognition) works regardless of the pitch range of speakers Timbre Understanding MFCC, The Classic (1970s - Now)
  • 97. Notes Why? Designed to be pitch-invariant So that (speech recognition) works regardless of the pitch range of speakers The first value of MFCC represents energy of the sound and is often omitted So that it works regardless of how loud the speech is Timbre Understanding MFCC, The Classic (1970s - Now)
  • 98. Notes Why? Designed to be pitch-invariant So that (speech recognition) works regardless of the pitch range of speakers The first value of MFCC represents energy of the sound and is often omitted So that it works regardless of how loud the speech is Designed for speech signals, widely used for music as well. MFCC (often) has the property we need in music analysis! (e.g., Genre / Mood of music remains the same 
 even if the key / volume of music changes) Timbre Understanding MFCC, The Classic (1970s - Now)
  • 99. • Designed to be pitch-invariant • Remove the loudness-related part • Therefore, MFCC should be about timbre! Timbre Understanding MFCC, The Classic (1970s - Now)
  • 102. 1. Even if the text is blurry, 2. Regardless of where it is, (as long as it looks so,) the model should do the job. Text / object recognition? Timbre Understanding Convolutional Neural Networks
  • 103. 1. Even if the text is blurry, 2. Regardless of where it is, (as long as it looks so,) the model should do the job. Text / object recognition? 1. Even if the volume is low, 2. Regardless of the key/pitch (as long as it sounds so,) the model should do the job Music genre classification? Timbre Understanding Convolutional Neural Networks
  • 104. 1. Even if the text is blurry, 2. Regardless of where it is, (as long as it looks so,) the model should do the job. Text / object recognition? 1. Even if the volume is low, 2. Regardless of the key/pitch (as long as it sounds so,) the model should do the job Music genre classification? Timbre Understanding Convolutional Neural Networks Texture Texture (timbre)
  • 106. • Unlike popular misunderstandings, • Neural networks =/= human nerve systems • Convnets =/= How we see • Convnets =/= Vision Timbre Understanding Convolutional Neural Networks
  • 107. • Unlike popular misunderstandings, • Neural networks =/= human nerve systems • Convnets =/= How we see • Convnets =/= Vision • Convnets: • Designed to be sensitive to some aspects of the input data; while invariant to some others (small local changes) • They are somewhat similar to how we recognize music • In particular, they're good at capturing timbre. Timbre Understanding Convolutional Neural Networks
  • 108. • Automatic tagging using deep convolutional neural networks, Choi et al., ISMIR, 2016 • Borrowed the VGGNet to music as it is. Timbre Understanding Convolutional Neural Networks
  • 109. • Music tagging • Genre classification • Mood recognition • Instrument recognition • Similarity learning Timbre Understanding Convolutional Neural Networks
  • 110. • Music tagging • Genre classification • Mood recognition • Instrument recognition • Similarity learning No one has declared it's about timbre understanding. But, many people proposes a model as if it is. Timbre Understanding Convolutional Neural Networks
  • 111. • Music tagging • Genre classification • Mood recognition • Instrument recognition • Similarity learning The limit of this perspective has been overlooked since Convnets worked so well. No one has declared it's about timbre understanding. But, many people proposes a model as if it is. Timbre Understanding Convolutional Neural Networks
  • 112. • Music tagging • Genre classification • Mood recognition • Instrument recognition • Similarity learning Also, it's a good reminder of the importance of timbre in our musical perception. The limit of this perspective has been overlooked since Convnets worked so well. No one has declared it's about timbre understanding. But, many people proposes a model as if it is. Timbre Understanding Convolutional Neural Networks
  • 115. Note-level Understanding F0 (Fundamental Frequency) Estimation): Monophonic. Voice / Single instrument
  • 116. Note-level Understanding F0 (Fundamental Frequency) Estimation): Monophonic. Voice / Single instrument Melody Extraction - The definition of melody is subjective - Based on mixture music
  • 117. Note-level Understanding F0 (Fundamental Frequency) Estimation): Monophonic. Voice / Single instrument Melody Extraction - The definition of melody is subjective - Based on mixture music Transcription -defined by target instrument / recording environment / mono or polyphonic / ..
  • 118. • Endless tuning of models.. • Method 1 → Method 1' → 1'se → 1'se Max → 1'se Max Plus → .. • by adding assumptions on and on (distribution of notes, property of sound, ..) • with more complicated / specialized models • with reported performance improved • as a result of people focusing on improving on a certain dataset • The practicality went up? down? 🤔 Transcription research before deep learning (-2015) Note-level Understanding
  • 119. Transcription after deep learning • CNNs and RNNs were already there; so were transcription models based on them. They were doing well. • Then we had a breakthrough! • Onsets-and-frames (Hawthorne et al, 2018) Note-level Understanding
  • 122. Module 1: Onset Model (duration is ignored) Note-level Understanding Transcription after deep learning
  • 123. Module 1: Onset Model (duration is ignored) Note-level Understanding Transcription after deep learning
  • 124. Module 1: Onset Model (duration is ignored) Onset Note-level Understanding Transcription after deep learning
  • 125. Module 1: Onset Model (duration is ignored) Module 2 Onset Note-level Understanding Transcription after deep learning
  • 126. Module 1: Onset Model (duration is ignored) Module 3: Frame Model (to estimate duration; conditioned on onset prediction Module 2 Onset Note-level Understanding Transcription after deep learning
  • 127. Module 1: Onset Model (duration is ignored) Module 3: Frame Model (to estimate duration; conditioned on onset prediction Module 2 Onset Note-level Understanding Transcription after deep learning
  • 128. Module 1: Onset Model (duration is ignored) Module 3: Frame Model (to estimate duration; conditioned on onset prediction Module 2 Onset Frame Note-level Understanding Transcription after deep learning
  • 129. Module 1: Onset Model (duration is ignored) Module 3: Frame Model (to estimate duration; conditioned on onset prediction Module 2 Onset Frame 1. It's beneficial to teach onsets and frames separately. Note-level Understanding Transcription after deep learning
  • 130. Module 1: Onset Model (duration is ignored) Module 3: Frame Model (to estimate duration; conditioned on onset prediction Module 2 Onset Frame 1. It's beneficial to teach onsets and frames separately. 2. It helps to predict onsets; and then frames conditioned on onsets Note-level Understanding Transcription after deep learning
  • 131. Module 1: Onset Model (duration is ignored) Module 3: Frame Model (to estimate duration; conditioned on onset prediction Module 2 Onset Frame 1. It's beneficial to teach onsets and frames separately. 2. It helps to predict onsets; and then frames conditioned on onsets 3. Melspectrograms are good enough. Note-level Understanding Transcription after deep learning
  • 132. Model Dataset MAPS (2010) MAPS w/ diff config/metric Maestro (2018) FitzGerald et al. 2008 58 Vincent et al. 2010 67 Ewert et al. 2016 95 Kelz et al. 2016 79 51 81 Hawthorne et al. 2018 83 82 Hawthorne et al. 2019 83 86 - 95 Kong et al. 2020 97 Note-level Understanding 🚨 DISCLAIMER: Some numbers are probably incorrect. The metrics / datasets are complicated 🤯
  • 133. Model Dataset MAPS (2010) MAPS w/ diff config/metric Maestro (2018) FitzGerald et al. 2008 58 Vincent et al. 2010 67 Ewert et al. 2016 95 Kelz et al. 2016 79 51 81 Hawthorne et al. 2018 83 82 Hawthorne et al. 2019 83 86 - 95 Kong et al. 2020 97 Deep learning Note-level Understanding 🚨 DISCLAIMER: Some numbers are probably incorrect. The metrics / datasets are complicated 🤯
  • 134. Model Dataset MAPS (2010) MAPS w/ diff config/metric Maestro (2018) FitzGerald et al. 2008 58 Vincent et al. 2010 67 Ewert et al. 2016 95 Kelz et al. 2016 79 51 81 Hawthorne et al. 2018 83 82 Hawthorne et al. 2019 83 86 - 95 Kong et al. 2020 97 Deep learning Better deep learning (Onsets-and-frames) Note-level Understanding 🚨 DISCLAIMER: Some numbers are probably incorrect. The metrics / datasets are complicated 🤯
  • 135. DEMO: Real-time Piano Transcription (Kwon et al., 2020) 1. Recording 2. Transcription 3. Result Note-level Understanding
  • 136. DEMO: Real-time Piano Transcription (Kwon et al., 2020) 1. Recording 2. Transcription 3. Result Note-level Understanding
  • 137. Next Paradigm: Analysis-and-Synthesis; jointly • 🎹 wave2midi2wave (Hawthorne et al., 2019) • Utilized a paired midi-audio dataset • 🥁 DrummerNet (Choi and Cho, 2019) • Audio-only; unsupervised learning • Followed by: guitar (Wiggins and Kim, 2020) and piano (Cheuk et al., 2021, Benetos et al., 2021) Note-level Understanding
  • 139. DrummerNet Module 1: Transcription Synthesize drum signals (not trainable; not deep learning) Note-level Understanding
  • 140. DrummerNet Module 1: Transcription Synthesize drum signals (not trainable; not deep learning) 1. If the transcription works well, Note-level Understanding
  • 141. DrummerNet Module 1: Transcription Synthesize drum signals (not trainable; not deep learning) 1. If the transcription works well, 2. The synthesized audio based on the transcription should be.. Note-level Understanding
  • 142. DrummerNet Module 1: Transcription Synthesize drum signals (not trainable; not deep learning) 1. If the transcription works well, 2. The synthesized audio based on the transcription should be.. 3. Similar to the input audio! Note-level Understanding
  • 143. DrummerNet Module 1: Transcription Synthesize drum signals (not trainable; not deep learning) 1. If the transcription works well, 2. The synthesized audio based on the transcription should be.. 3. Similar to the input audio! (i.e., autoencoder) Note-level Understanding
  • 144. Discussion: Why always piano/drums? [1/3] • Models work well for 🎹 piano and 🥁 drums • Piano and drums; both have great virtual instruments. • Large datasets are all midi-based synthetic ones Note-level Understanding
  • 145. Discussion [2/3] • Models don't work well for instruments with time-varying nature.
 (🎷🎺 horns and woodwind, 🎻 strings) • Reason: Lack of training data; Why? because midi sucks for those instruments • The "time-varying" nature is partly instrument inherent; but also from the information players add to the score • This is because the limit of the information represented in "scores📄" • Do scores and notes really matters in pop music? Do DJs care? Electric guitarists? Note-level Understanding
  • 146. Discussion [3/3] Note-level Understanding • The need for special models for various instruments → not great • Are there clear boundaries between instruments? → not always
  • 149. Lyric Alignment • Sequence alignment is a VERY popular problem. • Text, DNA, speech, music, .. • Methods are waiting to be imported to music. • If needed, vocal separation works so well. • Public datasets are relatively small. • Seems like it's a lot advanced in industry where karaoke is a thing. Lyric Understanding
  • 150. Lyric transcription • Methods are all there already. • Pre-trained speech recognition models don't work. • We just need to train a model for this task but.. • Data 😭...😭😭😭 • Copyright 😭😭...... Lyric Understanding
  • 152. Input Output Information Signal Signal Analysis Audio Signal Processing Information Creation Audio Synthesis Music AI Loudness, Pitch, and the rest (=Timbre) Three Components of Sound
  • 153. Remark
  • 154. Remark • We've tried speech / language models quite enough. Let's focus on the difference! • Unlike speech, music is polyphonic; a lot more poly-timbral. • Compared to language, the information in the score is a lot more limited
  • 155. Remark • We've tried speech / language models quite enough. Let's focus on the difference! • Unlike speech, music is polyphonic; a lot more poly-timbral. • Compared to language, the information in the score is a lot more limited • Music datasets are "tiny"; but that's a part of the problem we should solve.
  • 156. Remark • We've tried speech / language models quite enough. Let's focus on the difference! • Unlike speech, music is polyphonic; a lot more poly-timbral. • Compared to language, the information in the score is a lot more limited • Music datasets are "tiny"; but that's a part of the problem we should solve. • Unlike language / speech / images / videos, • The music creation process is heavily, and nicely digitized. • → While they crawl, we can synthesize.
  • 158. Like music AI? • ISMIR ♥ - International Society of Music Information Retrieval
  • 159. Like music AI? • ISMIR ♥ - International Society of Music Information Retrieval • Creativity Workshops in NeurIPS / ICML
  • 160. Like music AI? • ISMIR ♥ - International Society of Music Information Retrieval • Creativity Workshops in NeurIPS / ICML • ICASSP, SMC
  • 161. Like music AI? • ISMIR ♥ - International Society of Music Information Retrieval • Creativity Workshops in NeurIPS / ICML • ICASSP, SMC
  • 162. Like music AI? • ISMIR ♥ - International Society of Music Information Retrieval • Creativity Workshops in NeurIPS / ICML • ICASSP, SMC • Lab showcase at ISMIR2021: 
 https://ismir2021.ismir.net/labshowcase/
  • 163. All you need is AI and music DS-GA 1011 F'21 Keunwoo Choi, , 2021-12-08
  • 164. - The End - All you need is AI and music DS-GA 1011 F'21 Keunwoo Choi, , 2021-12-08