10 Differences between Sales Cloud and CPQ, Blanka Doktorová
"All you need is AI and music" by Keunwoo Choi
1. All you need is AI and music
DS-GA 1011 F'21
Keunwoo Choi, , 2021-12-08
2. Keunwoo Choi
keunwoochoi.github.io
@keunwoochoi
• 🎶 Research scientist at ByteDance/TikTok
• 🎵 Research scientist at Spotify
• 🎼 PhD program, Queen Mary University of London
• 🔈 Acoustic research engineer, ETRI, Korea
• 🎧 Applied Acoustics (Master's program), Seoul National Univ.
• 🎸 EECS (Bs), Seoul National Univ.
Until
Now
2020
2018
2014
2011
2009
3. https://www.youtube.com/channel/UC6WGQvwwM3M7sX98zJ14XPA
Honor Code on paying attention to Keunwoo Choi’s music
As a student of "Natural Language Processing with Representation Learning", I
- listened to all the music (0:00 to the end) Keunwoo Choi uploaded on his YouTube channel,
- clicked "like" an odd-number times,
- clicked "subscribe" button an odd-number times,
- turned on the notification, and
- shared the channel and your top-30 favorite tracks.
Signature __________
Name __________
Date __________
4. Abstract 🍃
"..What is AI, and music AI? In this talk, we review the trend in music AI in four
categories - Analysis / Creation / Signal Synthesis / Signal Processing.
We put a special focus on Analysis; of timbre, notes, and lyrics.
Our goal is to understand what music AI researchers aim, assume, develop, overlook,
and misunderstand."
5. Abstract 🍃
"..What is AI, and music AI? In this talk, we review the trend in music AI in four
categories - Analysis / Creation / Signal Synthesis / Signal Processing.
We put a special focus on Analysis; of timbre, notes, and lyrics.
Our goal is to understand what music AI researchers aim, assume, develop, overlook,
and misunderstand."
6. Content
• Music AI [35 min]
• Analysis / Creation / Signal Synthesis / Signal Processing
• Analysis: [30 min]
• Timbral Understanding [15 min]
• Note-level Understanding [10 min]
• Lyric Understanding [ 5 min]
7. Content
• Music AI [35 min]
• Analysis / Creation / Signal Synthesis / Signal Processing
• Analysis: [30 min]
• Timbral Understanding [15 min]
• Note-level Understanding [10 min]
• Lyric Understanding [ 5 min]
8. All you need is AI and music
DS-GA 1011 F'21
Keunwoo Choi, , 2021-12-08
9. All you need is AI and music
DS-GA 1011 F'21
Keunwoo Choi, , 2021-12-08
12. Music AI
• Machines doing something musical as a response of some musical inputs
(ICMC '84)
13. Music AI
• Machines doing something musical as a response of some musical inputs
(ICMC '84)
14. Music AI
• Machines doing something musical as a response of some musical inputs
(ICMC '84)
15. Music AI
Input Output Information Signal
Signal
Analysis
(e.g., genre classification,
music similarity)
Audio Signal Processing
(e.g., automatic mixing,
source separation)
Information
Creation
(e.g., automatic composition,
lyric generation)
Audio Synthesis
(e.g., singing voice generation,
instrument sound synthesis )
16. Music AI
Input Output Information Signal
Signal
Analysis
(e.g., genre classification,
music similarity)
Audio Signal Processing
(e.g., automatic mixing,
source separation)
Information
Creation
(e.g., automatic composition,
lyric generation)
Audio Synthesis
(e.g., singing voice generation,
instrument sound synthesis )
23. Music AI - Synthesis
Background: Synthesizer
Volume knob
Keys to control
the pitch
Many knobs for
timbre control Synthesizer
THE SOUND
YOU WANT
24. Three Components of Sound
Loudness, Pitch, and the rest (=Timbre)
• Timbre: "that attribute of auditory sensation which enables a listener to judge
that two nonidentical sounds, similarly presented and having the same
loudness and pitch, are dissimilar" (Acoustical Society of America)
• Sound (that we perceive) := Loudness, Pitch, and the rest (Timbre)
Volume knob
Keys to pitch
Knobs to timbre
Synthesizer The sound
25. Three Components of Sound
Loudness, Pitch, and the rest (=Timbre)
• Timbre: "that attribute of auditory sensation which enables a listener to judge
that two nonidentical sounds, similarly presented and having the same
loudness and pitch, are dissimilar" (Acoustical Society of America)
• Sound (that we perceive) := Loudness, Pitch, and the rest (Timbre)
Volume knob
Keys to pitch
Knobs to timbre
Synthesizer The sound
26. Autoencoder
Music AI - Synthesis
Background: Autoencoder
Input
(28 x 28 = 784pixel)
Output
(28 x 28 = 784 pixel)
Module 1
(Encoder)
Module 2
(Decoder)
16D
27. Autoencoder
Music AI - Synthesis
Background: Autoencoder
Input
(28 x 28 = 784pixel)
Output
(28 x 28 = 784 pixel)
Module 1
(Encoder)
Module 2
(Decoder)
16D
Compress the input
in some sense
Decompress
28. Music AI - Synthesis
DDSP - Differential Digital Signal Processing (2019, Engel et al.)
29. Music AI - Synthesis
DDSP Explained
Input
Sound
(instrument,
monophonic)
Output
Sound
30. Music AI - Synthesis
DDSP Explained
Input
Sound
(instrument,
monophonic)
Synthesized
sound
(tonal)
Synthesized
sound
(noise)
Reverb
Output
Sound
36. Music AI - Synthesis
DDSP Explained
Input
Sound
(instrument,
monophonic)
Synthesized
sound
(tonal)
Synthesized
sound
(noise)
Reverb
Output
Sound
F0
Estimation
Postprocessing
Pitch Recognition
Loudness Recognition
Postprocessing
Volume knob
Keys to control
Knobs for timbre control Synthesizer Sound
37. Music AI - Synthesis
DDSP Explained
Input
Sound
(instrument,
monophonic)
Synthesized
sound
(tonal)
Synthesized
sound
(noise)
Reverb
Output
Sound
F0
Estimation
Postprocessing
Pitch Recognition
Loudness Recognition
Postprocessing
Volume knob
Keys to control
Knobs for timbre control Synthesizer Sound
38. Music AI - Synthesis
Input
Sound
(Instrument,
Monophonic)
음량
생성된
신호 1
(기본음,
배음)
생성된
신호 1
(잡음)
잔향
처리
• Decoder == Synthesizer
• Encoder == Listens to the input sound and figure out how to set the knobs 🎛, by (Z), to mimic the input
• Pitch and loudness have nothing to do with the core of DDSP.
• DDSP focuses on what pitch/loudness don't describe (=timbre)
후처리
Loudness Recognition
Postprocess
F0
estimation
Pitch Recognition
Output
Sound
DDSP Explained
timbre
39. Input
sound
(instrument,
monophonic)
음량
생성된
신호 1
(기본음,
배음)
생성된
신호 1
(잡음)
잔향
처리
후처리
Loudness recognition
Postprocess
F0
estimation
Pitch recognition
Output
sound
Music AI - Synthesis
Tone Transfer (https://magenta.tensorflow.org/tone-transfer)
• 1. Use a saxophone dataset to train a simplified DDSP.
• Now, the module became a synth saxophone that takes pitch/loudness as input
• 2. Mimic some saxophone playing with your voice, estimate the pitch and loudness, and put
it into the trained model.
40. Input
sound
(instrument,
monophonic)
음량
생성된
신호 1
(기본음,
배음)
생성된
신호 1
(잡음)
잔향
처리
후처리
Loudness recognition
Postprocess
F0
estimation
Pitch recognition
Output
sound
Music AI - Synthesis
Tone Transfer (https://magenta.tensorflow.org/tone-transfer)
• 1. Use a saxophone dataset to train a simplified DDSP.
• Now, the module became a synth saxophone that takes pitch/loudness as input
• 2. Mimic some saxophone playing with your voice, estimate the pitch and loudness, and put
it into the trained model.
Saxophone Synthesizer
41. Input
sound
(instrument,
monophonic)
음량
생성된
신호 1
(기본음,
배음)
생성된
신호 1
(잡음)
잔향
처리
후처리
Loudness recognition
Postprocess
F0
estimation
Pitch recognition
Output
sound
Music AI - Synthesis
Tone Transfer (https://magenta.tensorflow.org/tone-transfer)
• 1. Use a saxophone dataset to train a simplified DDSP.
• Now, the module became a synth saxophone that takes pitch/loudness as input
• 2. Mimic some saxophone playing with your voice, estimate the pitch and loudness, and put
it into the trained model.
Saxophone Synthesizer
Connected during
training only
44. Music AI - Synthesis
Other Applications
• Synthesis
• Drums, Piano, etc
• Singing voice synthesis, rapping synthesis
• With target voices
• At the right tempo / beat
45. Music AI - Synthesis
Other Applications
• Synthesis
• Drums, Piano, etc
• Singing voice synthesis, rapping synthesis
• With target voices
• At the right tempo / beat
👉
46. Music AI - Synthesis
Other Applications
• Synthesis
• Drums, Piano, etc
• Singing voice synthesis, rapping synthesis
• With target voices
• At the right tempo / beat
👉
47. Music AI - Synthesis
Other Applications
• Synthesis
• Drums, Piano, etc
• Singing voice synthesis, rapping synthesis
• With target voices
• At the right tempo / beat
👉
🎻 🗣
48. Music AI - Synthesis
Other Applications
• Synthesis
• Drums, Piano, etc
• Singing voice synthesis, rapping synthesis
• With target voices
• At the right tempo / beat
👉
🎻 🗣
49. Music AI - Synthesis
Other Applications
• Synthesis
• Drums, Piano, etc
• Singing voice synthesis, rapping synthesis
• With target voices
• At the right tempo / beat
👉
🎻 🗣
50. Music AI
Input Output Information Signal
Signal
Analysis
(e.g., genre classification,
music similarity)
Audio Signal Processing
(e.g., automatic mixing,
source separation)
Information
Creation
(e.g., automatic composition,
lyric generation)
Audio Synthesis
(e.g., singing voice generation,
instrument sound synthesis )
51. Music AI
Input Output Information Signal
Signal
Analysis
(e.g., genre classification,
music similarity)
Audio Signal Processing
(e.g., automatic mixing,
source separation)
Information
Creation
(e.g., automatic composition,
lyric generation)
Audio Synthesis
(e.g., singing voice generation,
instrument sound synthesis )
52. Music AI - Creation
..by a narrow definition
• Dumping the results ain't fun; we want to steer AI to get what we want.
53. Music AI - Creation
..by a narrow definition
• Dumping the results ain't fun; we want to steer AI to get what we want.
Composition
Model
Genre: Jazz 🎵 (Jazz music)
54. Music AI - Creation
..by a narrow definition
• Dumping the results ain't fun; we want to steer AI to get what we want.
Composition
Model
Genre: Jazz 🎵 (Jazz music)
Chord
generation
Chord progression starting with
"Dm7 G7"
CM7 C7 FM7 F7 Em7
EbM7 ..
55. Music AI - Creation
..by a narrow definition
• Dumping the results ain't fun; we want to steer AI to get what we want.
Composition
Model
Genre: Jazz 🎵 (Jazz music)
Chord
generation
Chord progression starting with
"Dm7 G7"
CM7 C7 FM7 F7 Em7
EbM7 ..
Accompaniment
generation
(some melody)
Accompaniment
(chords, rhythm, ..)
58. Background: Language models
Word
models
"cat", "dog"
"deep learning"
(0.2, 0.3), (0.23, 0.31)
(-1.2, -3.2)
Summarization
"Stock market news articles published by leading
companies are read by every trader to carry out their
trading activities as they provide real time and reliable
information about the organization. These news articles
"News article can be summarized effectively
using the proposed NLP model"
Music AI - Creation
59. Background: Language models
Word
models
"cat", "dog"
"deep learning"
(0.2, 0.3), (0.23, 0.31)
(-1.2, -3.2)
Summarization
"Stock market news articles published by leading
companies are read by every trader to carry out their
trading activities as they provide real time and reliable
information about the organization. These news articles
"News article can be summarized effectively
using the proposed NLP model"
Translation
"온라인으로 열리는 이 학회에서는 음악 지각 및 인지와 관련된 광
범위한 주제로 여러 논문 발표가 있을 예정이다."
This online conference will feature several papers on a wide
range of topics related to music perception and cognition.
Music AI - Creation
60. Music AI - Creation
Proposition: Similarity between Language and Music
• Language = Sequence of words (or we say so)
• We ignore all the other aspects that this definition doesn't include
• Music := Sequence of notes (we dare say so again)
• Let's also ignore timbre, lyrics, and culture / social aspects
word1 word2 word3 .. 🎵 🎶 🎶 🎵
63. Music AI - Creation
Language models, but with music data
• "Music Transformer: Generating Music with Long-Term Structure"
• https://magenta.tensorflow.org/music-transformer, Huang et al., ICLR 2019
• RNN → Transformer → Transformer with relative attention
64. Music AI - Creation
Language models, but with music data
• "Music Transformer: Generating Music with Long-Term Structure"
• https://magenta.tensorflow.org/music-transformer, Huang et al., ICLR 2019
• RNN → Transformer → Transformer with relative attention
65. Beyond adopting language models
• Music Transformer
• + relative attention
• MIDI-VAE: https://arxiv.org/abs/1809.07600
• Accent, instrumentation, ..
• Pop Music Transformer: https://arxiv.org/abs/2002.00212
• Information such as Beat / Downbeat / Bars is encoded in a "word"
Music AI - Creation
68. Why do we
make music?
Discussion 🔥
How do we
consume music?
Music AI - Creation
69. Why do we
make music?
Discussion 🔥
How do we
consume music?
To what extent, an
AI should / could / would
assist human
composers?
Music AI - Creation
70. Why do we
make music?
Is it a fair use of music
when a model "listens" to
million of songs?
Discussion 🔥
How do we
consume music?
To what extent, an
AI should / could / would
assist human
composers?
Music AI - Creation
71. Why do we
make music?
Who owns the
copyright of
AI creation?
Is it a fair use of music
when a model "listens" to
million of songs?
Discussion 🔥
How do we
consume music?
To what extent, an
AI should / could / would
assist human
composers?
Music AI - Creation
72. Input Output Information Signal
Signal
Analysis
(e.g., genre classification,
music similarity)
Audio Signal Processing
(e.g., automatic mixing,
source separation)
Information
Creation
(e.g., automatic composition,
lyric generation)
Audio Synthesis
(e.g., singing voice generation,
instrument sound synthesis )
Music AI
73. Input Output Information Signal
Signal
Analysis
(e.g., genre classification,
music similarity)
Audio Signal Processing
(e.g., automatic mixing,
source separation)
Information
Creation
(e.g., automatic composition,
lyric generation)
Audio Synthesis
(e.g., singing voice generation,
instrument sound synthesis )
Music AI
74. Music AI - Audio Signal Processing
Background: Music Source Separation
MSS Model
75. Music AI - Audio Signal Processing
DEMO: Vocal Source Separation
76. Music AI - Audio Signal Processing
DEMO: Vocal Source Separation
77. Traditional MSS had many assumptions
- Vocals are mixed at the center.
- Percussive instruments == flat over freq axis
- The lowest pitched sound == bass
- Different instruments are at the different
location (=angle and distance).
- Each instrument has characteristic frequency
energy distribution,
- which is invariant to pitch change.
- ..
Music AI - Audio Signal Processing
78. Recent MSS models only have high-level assumptions
- Human auditory systems are insensitive to the
absolute phase of sound.
- i) So let's not use it.
- ii) But we can still use it.
Music AI - Audio Signal Processing
79. Recent MSS models only have high-level assumptions
- Human auditory systems are insensitive to the
absolute phase of sound.
- i) So let's not use it.
- ii) But we can still use it.
Music AI - Audio Signal Processing
- Instruments have unique sound,
- which can be recognized within ? seconds.
- Instruments are discrete; and distinguishable.
80. Applications
• Enhance target signals
• Source separation
• Speech enhancement
• De-reverberation
• Automatic mixing, mastering, effect (DEMO; next slide - "Steerable discovery
of neural audio effects", 2021, Steinmetz and Reiss)
• Voice conversion
Music AI - Audio Signal Processing
81.
82.
83. Input Output Information Signal
Signal
Analysis
(e.g., genre classification,
music similarity)
Audio Signal Processing
(e.g., automatic mixing,
source separation)
Information
Creation
(e.g., automatic composition,
lyric generation)
Audio Synthesis
(e.g., singing voice generation,
instrument sound synthesis )
Music AI
84. Input Output Information Signal
Signal
Analysis
(e.g., genre classification,
music similarity)
Audio Signal Processing
(e.g., automatic mixing,
source separation)
Information
Creation
(e.g., automatic composition,
lyric generation)
Audio Synthesis
(e.g., singing voice generation,
instrument sound synthesis )
Music AI
85. Music AI - Analysis
MIR (Music Information
Retrieval),
Machine Listening:
All different kinds of
classification, recognition,
detection, ..
86. Music AI - Analysis
MIR (Music Information
Retrieval),
Machine Listening:
All different kinds of
classification, recognition,
detection, ..
This music is "Jazz"
The mood of this music is
"calm"
There are drums, piano and
bass
Tempo = 75 BPM
It's instrumental
Intro: 0:00 - 0:45
Bridge: 0:45 - 1:27, ..
87. Input Output Information Signal
Signal
Analysis
(e.g., genre classification,
music similarity)
Audio
Signal
Process
ing
(e.g.,
automat
ic
mixing,
source
separati
Information
Creation
(e.g., automatic composition,
lyric generation)
Audio
Synthes
is
Music AI
88. Input Output Information Signal
Signal
Analysis
(e.g., genre classification,
music similarity)
Audio
Signal
Process
ing
(e.g.,
automat
ic
mixing,
source
separati
Information
Creation
(e.g., automatic composition,
lyric generation)
Audio
Synthes
is
1. Timbre
2. Notes
3. Lyrics
Music AI
90. Timbre Understanding
MFCC, The Classic (1970s - Now)
• Mel-Frequecy Cepstral Coefficients
Represent our perceptual frequency response with some numbers (=a vector)
91. Timbre Understanding
MFCC, The Classic (1970s - Now)
• Mel-Frequecy Cepstral Coefficients
Represent our perceptual frequency response with some numbers (=a vector)
• Mel-Frequecy Cepstral Coefficients
→ Represent human's perceptual frequency sensing with some numbers (=a vector)
92. Timbre Understanding
MFCC, The Classic (1970s - Now)
• Mel-Frequecy Cepstral Coefficients
Represent our perceptual frequency response with some numbers (=a vector)
👂
Auditory modeling
using some formula
0.1 0.9 0.2 0.8
• Mel-Frequecy Cepstral Coefficients
→ Represent human's perceptual frequency sensing with some numbers (=a vector)
93. Timbre Understanding
MFCC, The Classic (1970s - Now)
• Mel-Frequecy Cepstral Coefficients
Represent our perceptual frequency response with some numbers (=a vector)
👂
Auditory modeling
using some formula
0.1 0.9 0.2 0.8
👂 -1.0 1.7 0.3 0.8
• Mel-Frequecy Cepstral Coefficients
→ Represent human's perceptual frequency sensing with some numbers (=a vector)
94. MFCC20 of the first frame
MFCC20 of the second frame
Timbre Understanding
MFCC, The Classic (1970s - Now)
96. Notes Why?
Designed to be pitch-invariant
So that (speech recognition) works
regardless of the pitch range of speakers
Timbre Understanding
MFCC, The Classic (1970s - Now)
97. Notes Why?
Designed to be pitch-invariant
So that (speech recognition) works
regardless of the pitch range of speakers
The first value of MFCC represents energy of
the sound and is often omitted
So that it works
regardless of how loud the speech is
Timbre Understanding
MFCC, The Classic (1970s - Now)
98. Notes Why?
Designed to be pitch-invariant
So that (speech recognition) works
regardless of the pitch range of speakers
The first value of MFCC represents energy of
the sound and is often omitted
So that it works
regardless of how loud the speech is
Designed for speech signals,
widely used for music as well.
MFCC (often) has the property we need in music analysis!
(e.g., Genre / Mood of music remains the same
even if the key / volume of music changes)
Timbre Understanding
MFCC, The Classic (1970s - Now)
99. • Designed to be pitch-invariant
• Remove the loudness-related part
• Therefore, MFCC should be about timbre!
Timbre Understanding
MFCC, The Classic (1970s - Now)
102. 1. Even if the text is blurry,
2. Regardless of where it is,
(as long as it looks so,)
the model should do the job.
Text / object
recognition?
Timbre Understanding
Convolutional Neural Networks
103. 1. Even if the text is blurry,
2. Regardless of where it is,
(as long as it looks so,)
the model should do the job.
Text / object
recognition?
1. Even if the volume is low,
2. Regardless of the key/pitch
(as long as it sounds so,)
the model should do the job
Music genre
classification?
Timbre Understanding
Convolutional Neural Networks
104. 1. Even if the text is blurry,
2. Regardless of where it is,
(as long as it looks so,)
the model should do the job.
Text / object
recognition?
1. Even if the volume is low,
2. Regardless of the key/pitch
(as long as it sounds so,)
the model should do the job
Music genre
classification?
Timbre Understanding
Convolutional Neural Networks
Texture
Texture
(timbre)
106. • Unlike popular misunderstandings,
• Neural networks =/= human nerve systems
• Convnets =/= How we see
• Convnets =/= Vision
Timbre Understanding
Convolutional Neural Networks
107. • Unlike popular misunderstandings,
• Neural networks =/= human nerve systems
• Convnets =/= How we see
• Convnets =/= Vision
• Convnets:
• Designed to be sensitive to some aspects of the input data; while invariant to some others
(small local changes)
• They are somewhat similar to how we recognize music
• In particular, they're good at capturing timbre.
Timbre Understanding
Convolutional Neural Networks
108. • Automatic tagging using deep convolutional neural networks, Choi et al.,
ISMIR, 2016
• Borrowed the VGGNet to music as it is.
Timbre Understanding
Convolutional Neural Networks
110. • Music tagging
• Genre classification
• Mood recognition
• Instrument
recognition
• Similarity learning
No one has declared it's about timbre understanding.
But, many people proposes a model as if it is.
Timbre Understanding
Convolutional Neural Networks
111. • Music tagging
• Genre classification
• Mood recognition
• Instrument
recognition
• Similarity learning
The limit of this perspective has been
overlooked since Convnets worked so well.
No one has declared it's about timbre understanding.
But, many people proposes a model as if it is.
Timbre Understanding
Convolutional Neural Networks
112. • Music tagging
• Genre classification
• Mood recognition
• Instrument
recognition
• Similarity learning
Also, it's a good reminder of the importance
of timbre in our musical perception.
The limit of this perspective has been
overlooked since Convnets worked so well.
No one has declared it's about timbre understanding.
But, many people proposes a model as if it is.
Timbre Understanding
Convolutional Neural Networks
116. Note-level Understanding
F0 (Fundamental Frequency) Estimation):
Monophonic. Voice / Single instrument
Melody Extraction
- The definition of melody is subjective
- Based on mixture music
117. Note-level Understanding
F0 (Fundamental Frequency) Estimation):
Monophonic. Voice / Single instrument
Melody Extraction
- The definition of melody is subjective
- Based on mixture music
Transcription
-defined by target instrument / recording
environment / mono or polyphonic / ..
118. • Endless tuning of models..
• Method 1 → Method 1' → 1'se → 1'se Max → 1'se Max Plus → ..
• by adding assumptions on and on (distribution of notes, property of sound, ..)
• with more complicated / specialized models
• with reported performance improved
• as a result of people focusing on improving on a certain dataset
• The practicality went up? down? 🤔
Transcription research before deep learning (-2015)
Note-level Understanding
119. Transcription after deep learning
• CNNs and RNNs were already there; so were transcription models based on
them. They were doing well.
• Then we had a breakthrough!
• Onsets-and-frames (Hawthorne et al, 2018)
Note-level Understanding
126. Module 1:
Onset
Model
(duration is ignored)
Module 3:
Frame Model
(to estimate duration;
conditioned on
onset prediction
Module 2
Onset
Note-level Understanding
Transcription after deep learning
127. Module 1:
Onset
Model
(duration is ignored)
Module 3:
Frame Model
(to estimate duration;
conditioned on
onset prediction
Module 2
Onset
Note-level Understanding
Transcription after deep learning
128. Module 1:
Onset
Model
(duration is ignored)
Module 3:
Frame Model
(to estimate duration;
conditioned on
onset prediction
Module 2
Onset
Frame
Note-level Understanding
Transcription after deep learning
129. Module 1:
Onset
Model
(duration is ignored)
Module 3:
Frame Model
(to estimate duration;
conditioned on
onset prediction
Module 2
Onset
Frame
1. It's beneficial to
teach onsets and
frames separately.
Note-level Understanding
Transcription after deep learning
130. Module 1:
Onset
Model
(duration is ignored)
Module 3:
Frame Model
(to estimate duration;
conditioned on
onset prediction
Module 2
Onset
Frame
1. It's beneficial to
teach onsets and
frames separately.
2. It helps to predict
onsets; and then
frames conditioned
on onsets
Note-level Understanding
Transcription after deep learning
131. Module 1:
Onset
Model
(duration is ignored)
Module 3:
Frame Model
(to estimate duration;
conditioned on
onset prediction
Module 2
Onset
Frame
1. It's beneficial to
teach onsets and
frames separately.
2. It helps to predict
onsets; and then
frames conditioned
on onsets
3. Melspectrograms
are good enough.
Note-level Understanding
Transcription after deep learning
132. Model Dataset
MAPS
(2010)
MAPS w/ diff
config/metric
Maestro
(2018)
FitzGerald et al.
2008
58
Vincent et al.
2010
67
Ewert et al. 2016 95
Kelz et al. 2016 79 51 81
Hawthorne et al.
2018
83 82
Hawthorne et al.
2019
83 86 - 95
Kong et al.
2020
97
Note-level Understanding
🚨 DISCLAIMER: Some numbers are probably incorrect.
The metrics / datasets are complicated 🤯
133. Model Dataset
MAPS
(2010)
MAPS w/ diff
config/metric
Maestro
(2018)
FitzGerald et al.
2008
58
Vincent et al.
2010
67
Ewert et al. 2016 95
Kelz et al. 2016 79 51 81
Hawthorne et al.
2018
83 82
Hawthorne et al.
2019
83 86 - 95
Kong et al.
2020
97
Deep learning
Note-level Understanding
🚨 DISCLAIMER: Some numbers are probably incorrect.
The metrics / datasets are complicated 🤯
134. Model Dataset
MAPS
(2010)
MAPS w/ diff
config/metric
Maestro
(2018)
FitzGerald et al.
2008
58
Vincent et al.
2010
67
Ewert et al. 2016 95
Kelz et al. 2016 79 51 81
Hawthorne et al.
2018
83 82
Hawthorne et al.
2019
83 86 - 95
Kong et al.
2020
97
Deep learning
Better deep learning (Onsets-and-frames)
Note-level Understanding
🚨 DISCLAIMER: Some numbers are probably incorrect.
The metrics / datasets are complicated 🤯
135. DEMO: Real-time Piano Transcription (Kwon et al., 2020)
1. Recording
2. Transcription
3. Result
Note-level Understanding
136. DEMO: Real-time Piano Transcription (Kwon et al., 2020)
1. Recording
2. Transcription
3. Result
Note-level Understanding
137. Next Paradigm: Analysis-and-Synthesis; jointly
• 🎹 wave2midi2wave (Hawthorne et al., 2019)
• Utilized a paired midi-audio dataset
• 🥁 DrummerNet (Choi and Cho, 2019)
• Audio-only; unsupervised learning
• Followed by: guitar (Wiggins and Kim, 2020) and piano (Cheuk et al., 2021,
Benetos et al., 2021)
Note-level Understanding
141. DrummerNet
Module 1: Transcription
Synthesize drum signals
(not trainable; not deep learning)
1. If the transcription
works well,
2. The synthesized audio
based on the
transcription should be..
Note-level Understanding
142. DrummerNet
Module 1: Transcription
Synthesize drum signals
(not trainable; not deep learning)
1. If the transcription
works well,
2. The synthesized audio
based on the
transcription should be..
3. Similar to the
input audio!
Note-level Understanding
143. DrummerNet
Module 1: Transcription
Synthesize drum signals
(not trainable; not deep learning)
1. If the transcription
works well,
2. The synthesized audio
based on the
transcription should be..
3. Similar to the
input audio!
(i.e., autoencoder)
Note-level Understanding
144. Discussion: Why always piano/drums? [1/3]
• Models work well for 🎹 piano and 🥁 drums
• Piano and drums; both have great virtual instruments.
• Large datasets are all midi-based synthetic ones
Note-level Understanding
145. Discussion [2/3]
• Models don't work well for instruments with time-varying nature.
(🎷🎺 horns and woodwind, 🎻 strings)
• Reason: Lack of training data; Why? because midi sucks for those instruments
• The "time-varying" nature is partly instrument inherent; but also from the information
players add to the score
• This is because the limit of the information represented in "scores📄"
• Do scores and notes really matters in pop music? Do DJs care? Electric guitarists?
Note-level Understanding
149. Lyric Alignment
• Sequence alignment is a VERY popular problem.
• Text, DNA, speech, music, ..
• Methods are waiting to be imported to music.
• If needed, vocal separation works so well.
• Public datasets are relatively small.
• Seems like it's a lot advanced in industry where karaoke is a thing.
Lyric Understanding
150. Lyric transcription
• Methods are all there already.
• Pre-trained speech recognition models don't work.
• We just need to train a model for this task but..
• Data 😭...😭😭😭
• Copyright 😭😭......
Lyric Understanding
152. Input Output Information Signal
Signal Analysis Audio Signal Processing
Information Creation Audio Synthesis
Music AI
Loudness, Pitch, and the rest (=Timbre)
Three Components of Sound
154. Remark
• We've tried speech / language models quite enough. Let's focus on the difference!
• Unlike speech, music is polyphonic; a lot more poly-timbral.
• Compared to language, the information in the score is a lot more limited
155. Remark
• We've tried speech / language models quite enough. Let's focus on the difference!
• Unlike speech, music is polyphonic; a lot more poly-timbral.
• Compared to language, the information in the score is a lot more limited
• Music datasets are "tiny"; but that's a part of the problem we should solve.
156. Remark
• We've tried speech / language models quite enough. Let's focus on the difference!
• Unlike speech, music is polyphonic; a lot more poly-timbral.
• Compared to language, the information in the score is a lot more limited
• Music datasets are "tiny"; but that's a part of the problem we should solve.
• Unlike language / speech / images / videos,
• The music creation process is heavily, and nicely digitized.
• → While they crawl, we can synthesize.
158. Like music AI?
• ISMIR ♥ - International Society of Music Information Retrieval
159. Like music AI?
• ISMIR ♥ - International Society of Music Information Retrieval
• Creativity Workshops in NeurIPS / ICML
160. Like music AI?
• ISMIR ♥ - International Society of Music Information Retrieval
• Creativity Workshops in NeurIPS / ICML
• ICASSP, SMC
161. Like music AI?
• ISMIR ♥ - International Society of Music Information Retrieval
• Creativity Workshops in NeurIPS / ICML
• ICASSP, SMC
162. Like music AI?
• ISMIR ♥ - International Society of Music Information Retrieval
• Creativity Workshops in NeurIPS / ICML
• ICASSP, SMC
• Lab showcase at ISMIR2021:
https://ismir2021.ismir.net/labshowcase/
163. All you need is AI and music
DS-GA 1011 F'21
Keunwoo Choi, , 2021-12-08
164. - The End -
All you need is AI and music
DS-GA 1011 F'21
Keunwoo Choi, , 2021-12-08