"All you need is AI and music" by Keunwoo Choi

All you need is AI and music
DS-GA 1011 F'21
Keunwoo Choi, , 2021-12-08

Keunwoo Choi
keunwoochoi.github.io
@keunwoochoi
• 🎶 Research scientist at ByteDance/TikTok

• 🎵 Research scientist at Spotify

• 🎼 PhD program, Queen Mary University of London

• 🔈 Acoustic research engineer, ETRI, Korea

• 🎧 Applied Acoustics (Master's program), Seoul National Univ.

• 🎸 EECS (Bs), Seoul National Univ.
Until
Now
2020
2018
2014
2011
2009

https://www.youtube.com/channel/UC6WGQvwwM3M7sX98zJ14XPA
Honor Code on paying attention to Keunwoo Choi’s music
As a student of "Natural Language Processing with Representation Learning", I
- listened to all the music (0:00 to the end) Keunwoo Choi uploaded on his YouTube channel,
- clicked "like" an odd-number times,
- clicked "subscribe" button an odd-number times,
- turned on the notification, and
- shared the channel and your top-30 favorite tracks.
Signature __________
Name __________
Date __________

Abstract 🍃
"..What is AI, and music AI? In this talk, we review the trend in music AI in four
categories - Analysis / Creation / Signal Synthesis / Signal Processing.
We put a special focus on Analysis; of timbre, notes, and lyrics.
Our goal is to understand what music AI researchers aim, assume, develop, overlook,
and misunderstand."

Content
• Music AI [35 min]

• Analysis / Creation / Signal Synthesis / Signal Processing

• Analysis: [30 min]
• Timbral Understanding [15 min]

• Note-level Understanding [10 min]

• Lyric Understanding [ 5 min]

Music AI
• Machines doing something musical as a response of some musical inputs
(ICMC '84)

Music AI
Input Output Information Signal
Signal
Analysis

(e.g., genre classification,

music similarity)
Audio Signal Processing

(e.g., automatic mixing,

source separation)
Information
Creation

(e.g., automatic composition,

lyric generation)
Audio Synthesis

(e.g., singing voice generation,

instrument sound synthesis )

Music AI - Synthesis
Background: Synthesizer

Volume knob
Keys to control
the pitch
Many knobs to control the
timbre

Volume knob
Keys to control
the pitch
Many knobs for
timbre control Synthesizer
THE SOUND
YOU WANT

Three Components of Sound
Loudness, Pitch, and the rest (=Timbre)
• Timbre: "that attribute of auditory sensation which enables a listener to judge
that two nonidentical sounds, similarly presented and having the same
loudness and pitch, are dissimilar" (Acoustical Society of America)

• Sound (that we perceive) := Loudness, Pitch, and the rest (Timbre)
Volume knob
Keys to pitch
Knobs to timbre
Synthesizer The sound

Autoencoder
Background: Autoencoder
Input
(28 x 28 = 784pixel)
Output
(28 x 28 = 784 pixel)
Module 1

(Encoder)
Module 2

(Decoder)
16D

Autoencoder
Background: Autoencoder
Input
(28 x 28 = 784pixel)
Output
(28 x 28 = 784 pixel)
Module 1

(Encoder)
Module 2

(Decoder)
16D
Compress the input
in some sense
Decompress

DDSP - Diﬀerential Digital Signal Processing (2019, Engel et al.)

DDSP Explained
Input

Sound

(instrument,

monophonic)
Output

Sound

DDSP Explained
Input

Sound

(instrument,

monophonic)
Synthesized

sound

(tonal)
Synthesized

sound

(noise)
Reverb
Output

Sound

DDSP Explained
Input

Sound

(instrument,

monophonic)
Synthesized

sound

(tonal)
Synthesized

sound

(noise)
Reverb
Output

Sound
F0
Estimation

DDSP Explained
Input

Sound

(instrument,

monophonic)
Synthesized

sound

(tonal)
Synthesized

sound

(noise)
Reverb
Output

Sound
F0
Estimation
Pitch Recognition

DDSP Explained
Input

Sound

(instrument,

monophonic)
Synthesized

sound

(tonal)
Synthesized

sound

(noise)
Reverb
Output

Sound
F0
Estimation
Pitch Recognition
Loudness Recognition

DDSP Explained
Input

Sound

(instrument,

monophonic)
Synthesized

sound

(tonal)
Synthesized

sound

(noise)
Reverb
Output

Sound
F0
Estimation
Postprocessing
Pitch Recognition

DDSP Explained
Input

Sound

(instrument,

monophonic)
Synthesized

sound

(tonal)
Synthesized

sound

(noise)
Reverb
Output

Sound
F0
Estimation
Postprocessing
Pitch Recognition
Postprocessing

DDSP Explained
Input

Sound

(instrument,

monophonic)
Synthesized

sound

(tonal)
Synthesized

sound

(noise)
Reverb
Output

Sound
F0
Estimation
Postprocessing
Pitch Recognition
Postprocessing
Volume knob
Keys to control
Knobs for timbre control Synthesizer Sound

Input

Sound

(Instrument,

Monophonic)
음량
생성된

신호 1

(기본음,

배음)
생성된

신호 1

(잡음)
잔향

처리

• Decoder == Synthesizer
• Encoder == Listens to the input sound and figure out how to set the knobs 🎛, by (Z), to mimic the input
• Pitch and loudness have nothing to do with the core of DDSP.
• DDSP focuses on what pitch/loudness don't describe (=timbre)
후처리
Postprocess
F0
estimation
Pitch Recognition
Output

Sound
DDSP Explained
timbre

Input

sound

(instrument,

monophonic)
음량
생성된

신호 1

(기본음,

배음)
생성된

신호 1

(잡음)
잔향

처리

후처리
Loudness recognition
Postprocess
F0
estimation
Pitch recognition
Output

sound
Tone Transfer (https://magenta.tensorflow.org/tone-transfer)
• 1. Use a saxophone dataset to train a simplified DDSP.

• Now, the module became a synth saxophone that takes pitch/loudness as input

• 2. Mimic some saxophone playing with your voice, estimate the pitch and loudness, and put
it into the trained model.

Input

sound

(instrument,

monophonic)
음량
생성된

신호 1

(기본음,

배음)
생성된

신호 1

(잡음)
잔향

처리

후처리
Postprocess
F0
estimation
Pitch recognition
Output

sound


Saxophone Synthesizer

Input

sound

(instrument,

monophonic)
음량
생성된

신호 1

(기본음,

배음)
생성된

신호 1

(잡음)
잔향

처리

후처리
Postprocess
F0
estimation
Pitch recognition
Output

sound


Saxophone Synthesizer
Connected during
training only

Other Applications
• Synthesis

• Drums, Piano, etc

• Singing voice synthesis, rapping synthesis

• With target voices

• At the right tempo / beat

Other Applications
• Synthesis




👉

Other Applications
• Synthesis




👉
🎻 🗣

Music AI - Creation
..by a narrow definition
• Dumping the results ain't fun; we want to steer AI to get what we want.

Music AI - Creation
Composition
Model
Genre: Jazz 🎵 (Jazz music)

Music AI - Creation
Composition
Model
Chord
generation
Chord progression starting with
"Dm7 G7"
CM7 C7 FM7 F7 Em7
EbM7 ..

Music AI - Creation
Composition
Model
Chord
generation
Chord progression starting with
"Dm7 G7"
CM7 C7 FM7 F7 Em7
EbM7 ..
Accompaniment
generation
(some melody)
Accompaniment
(chords, rhythm, ..)

Background: Language models
Music AI - Creation

Word
models
"cat", "dog"
"deep learning"
(0.2, 0.3), (0.23, 0.31)
(-1.2, -3.2)
Music AI - Creation

Word
models
"cat", "dog"
"deep learning"
(0.2, 0.3), (0.23, 0.31)
(-1.2, -3.2)
Summarization
"Stock market news articles published by leading
companies are read by every trader to carry out their
trading activities as they provide real time and reliable
information about the organization. These news articles
"News article can be summarized effectively
using the proposed NLP model"
Music AI - Creation

Word
models
"cat", "dog"
"deep learning"
(0.2, 0.3), (0.23, 0.31)
(-1.2, -3.2)
Summarization
"Stock market news articles published by leading
companies are read by every trader to carry out their
trading activities as they provide real time and reliable
information about the organization. These news articles
"News article can be summarized effectively
using the proposed NLP model"
Translation
"온라인으로 열리는 이 학회에서는 음악 지각 및 인지와 관련된 광
범위한 주제로 여러 논문 발표가 있을 예정이다."
This online conference will feature several papers on a wide
range of topics related to music perception and cognition.
Music AI - Creation

Music AI - Creation
Proposition: Similarity between Language and Music
• Language = Sequence of words (or we say so)

• We ignore all the other aspects that this definition doesn't include

• Music := Sequence of notes (we dare say so again)

• Let's also ignore timbre, lyrics, and culture / social aspects
word1 word2 word3 .. 🎵 🎶 🎶 🎵

• Text-based LSTM networks for Automatic Music Composition, Choi et al., 2016

• https://soundcloud.com/kchoi-research/sets/lstm-realbook-1-5

• Let the model "read" chord progression. Ok, write some chords?

• DEMO: "LSTM Realbook 4.mp3"
Music AI - Creation
Language models, but with music data
Result: ..G:7(b9) C:maj C:maj A:min A:min D:min7 D:min7 G:7(b9) G:7(b9) C:maj C:maj C:maj C:maj A:min7 A:min7
A:min7 A:min7 D:9 D:9 D:9 G:7(b9) | C:maj C:maj A:min A:min | D:min7 D:min7 G:7(b9) G:7(b9) | C:maj C:maj C:maj
C:maj |A:min7 A:min7 A:min7 A:min7 | D:9 D:9 D:9 D:9 | D:9 D:9 D:9 D:9 | D:7 D:7 D:7 D:7 | D:min7 D:min7 D:min7
D:min7 | G:7 G:7 G:7 G:7 | C:maj C:maj C:maj C:maj | C:7 C:7 C:7 C:7 | F:maj F:maj F:maj F:maj | F:min F:min F:min
F:min | C:maj C:maj C:maj C:maj | C:maj C:maj C:maj C:maj D:7 D:7 D:7 D:7 G:7 G:7 G:7 G:7 C:maj C:maj C:maj C:maj
C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj G:7 G:7 G:7 G:7 G:7 G:7 G:7 G:7 C:maj
C:maj C:maj C:maj C:maj C:maj C:maj C:maj G:7 G:7 G:7 G:7 G:7 G:7 G:7 G:7 C:maj C:maj C:maj C:maj C:maj C:maj
C:maj C:maj G:7 G:7 G:7 G:7 G:7 G:7 G:7 G:7 C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj
C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj C:maj
Model only generates the chords.
I made the rest for demo purposes.

Music AI - Creation
Language models, but with music data
• "Music Transformer: Generating Music with Long-Term Structure"

• https://magenta.tensorflow.org/music-transformer, Huang et al., ICLR 2019

• RNN → Transformer → Transformer with relative attention

Beyond adopting language models
• Music Transformer

• + relative attention

• MIDI-VAE: https://arxiv.org/abs/1809.07600

• Accent, instrumentation, ..

• Pop Music Transformer: https://arxiv.org/abs/2002.00212

• Information such as Beat / Downbeat / Bars is encoded in a "word"
Music AI - Creation

Discussion 🔥
Music AI - Creation

Discussion 🔥
How do we
consume music?
Music AI - Creation

Why do we
make music?
Discussion 🔥
How do we
consume music?
Music AI - Creation

Why do we
make music?
Discussion 🔥
How do we
consume music?
To what extent, an
AI should / could / would
assist human
composers?
Music AI - Creation

Why do we
make music?
Is it a fair use of music
when a model "listens" to
million of songs?
Discussion 🔥
How do we
consume music?
To what extent, an
assist human
composers?
Music AI - Creation

Why do we
make music?
Who owns the
copyright of
AI creation?
Is it a fair use of music
when a model "listens" to
million of songs?
Discussion 🔥
How do we
consume music?
To what extent, an
assist human
composers?
Music AI - Creation

Signal
Analysis


music similarity)
Audio Signal Processing

(e.g., automatic mixing,

source separation)
Information
Creation


lyric generation)
Audio Synthesis

(e.g., singing voice generation,

instrument sound synthesis )
Music AI

Music AI - Audio Signal Processing
Background: Music Source Separation
MSS Model

DEMO: Vocal Source Separation

Traditional MSS had many assumptions
- Vocals are mixed at the center.
- Percussive instruments == flat over freq axis
- The lowest pitched sound == bass
- Different instruments are at the different
location (=angle and distance).
- Each instrument has characteristic frequency
energy distribution,
- which is invariant to pitch change.
- ..

Recent MSS models only have high-level assumptions
- Human auditory systems are insensitive to the
absolute phase of sound.
- i) So let's not use it.
- ii) But we can still use it.

Recent MSS models only have high-level assumptions
- Human auditory systems are insensitive to the
absolute phase of sound.
- i) So let's not use it.
- ii) But we can still use it.
- Instruments have unique sound,
- which can be recognized within ? seconds.
- Instruments are discrete; and distinguishable.

Applications
• Enhance target signals

• Source separation

• Speech enhancement

• De-reverberation

• Automatic mixing, mastering, eﬀect (DEMO; next slide - "Steerable discovery
of neural audio eﬀects", 2021, Steinmetz and Reiss)

• Voice conversion

Music AI - Analysis
MIR (Music Information
Retrieval),
Machine Listening:
All different kinds of
classification, recognition,
detection, ..

Music AI - Analysis
MIR (Music Information
Retrieval),
Machine Listening:
All different kinds of
classification, recognition,
detection, ..
This music is "Jazz"
The mood of this music is
"calm"
There are drums, piano and
bass
Tempo = 75 BPM
It's instrumental
Intro: 0:00 - 0:45
Bridge: 0:45 - 1:27, ..

Signal
Analysis


music similarity)
Audio
Signal
Process
ing

(e.g.,
automat
ic
mixing,

source
separati
Information
Creation


lyric generation)
Audio
Synthes
is

Music AI

Signal
Analysis


music similarity)
Audio
Signal
Process
ing

(e.g.,
automat
ic
mixing,

source
separati
Information
Creation


lyric generation)
Audio
Synthes
is

1. Timbre
2. Notes
3. Lyrics
Music AI

Timbre Understanding
MFCC, The Classic (1970s - Now)
• Mel-Frequecy Cepstral Coeﬃcients 
Represent our perceptual frequency response with some numbers (=a vector)

→ Represent human's perceptual frequency sensing with some numbers (=a vector)

👂
Auditory modeling

using some formula
0.1 0.9 0.2 0.8

👂
Auditory modeling

using some formula
0.1 0.9 0.2 0.8
👂 -1.0 1.7 0.3 0.8

MFCC20 of the first frame
MFCC20 of the second frame

Notes Why?

Notes Why?
Designed to be pitch-invariant
So that (speech recognition) works

regardless of the pitch range of speakers

Notes Why?

The first value of MFCC represents energy of
the sound and is often omitted
So that it works

regardless of how loud the speech is

Notes Why?

The first value of MFCC represents energy of
the sound and is often omitted
So that it works

regardless of how loud the speech is
Designed for speech signals,

widely used for music as well.
MFCC (often) has the property we need in music analysis!

(e.g., Genre / Mood of music remains the same  
even if the key / volume of music changes)

• Designed to be pitch-invariant

• Remove the loudness-related part

• Therefore, MFCC should be about timbre!

Convolutional Neural Networks

1. Even if the text is blurry,
2. Regardless of where it is,
(as long as it looks so,)
the model should do the job.
Text / object
recognition?

Text / object
recognition?
1. Even if the volume is low,
2. Regardless of the key/pitch
(as long as it sounds so,)
the model should do the job
Music genre
classification?

Text / object
recognition?
1. Even if the volume is low,
2. Regardless of the key/pitch
(as long as it sounds so,)
the model should do the job
Music genre
classification?
Texture
Texture
(timbre)

• Unlike popular misunderstandings,

• Neural networks =/= human nerve systems

• Convnets =/= How we see

• Convnets =/= Vision

• Unlike popular misunderstandings,

• Neural networks =/= human nerve systems

• Convnets =/= How we see

• Convnets =/= Vision
• Convnets:

• Designed to be sensitive to some aspects of the input data; while invariant to some others
(small local changes)

• They are somewhat similar to how we recognize music

• In particular, they're good at capturing timbre.

• Automatic tagging using deep convolutional neural networks, Choi et al.,
ISMIR, 2016

• Borrowed the VGGNet to music as it is.

• Music tagging

• Genre classification

• Mood recognition

• Instrument
recognition

• Similarity learning

• Music tagging



• Instrument
recognition

No one has declared it's about timbre understanding.
But, many people proposes a model as if it is.

• Music tagging



• Instrument
recognition

The limit of this perspective has been
overlooked since Convnets worked so well.

• Music tagging



• Instrument
recognition

Also, it's a good reminder of the importance
of timbre in our musical perception.
The limit of this perspective has been
overlooked since Convnets worked so well.

Note-level Understanding
F0 (Fundamental Frequency) Estimation):
Monophonic. Voice / Single instrument

Melody Extraction
- The definition of melody is subjective
- Based on mixture music

Melody Extraction
- The definition of melody is subjective
- Based on mixture music
Transcription
-defined by target instrument / recording
environment / mono or polyphonic / ..

• Endless tuning of models..

• Method 1 → Method 1' → 1'se → 1'se Max → 1'se Max Plus → ..

• by adding assumptions on and on (distribution of notes, property of sound, ..)

• with more complicated / specialized models

• with reported performance improved

• as a result of people focusing on improving on a certain dataset

• The practicality went up? down? 🤔
Transcription research before deep learning (-2015)

Transcription after deep learning
• CNNs and RNNs were already there; so were transcription models based on
them. They were doing well.

• Then we had a breakthrough!

• Onsets-and-frames (Hawthorne et al, 2018)

Module 1:
Onset
Model
(duration is ignored)

Module 1:
Onset
Model
Onset

Module 1:
Onset
Model
Module 2
Onset

Module 1:
Onset
Model
Module 3:
Frame Model
(to estimate duration;
conditioned on
onset prediction
Module 2
Onset

Module 1:
Onset
Model
Module 3:
Frame Model
conditioned on
onset prediction
Module 2
Onset
Frame

Module 1:
Onset
Model
Module 3:
Frame Model
conditioned on
onset prediction
Module 2
Onset
Frame
1. It's beneficial to
teach onsets and
frames separately.

Module 1:
Onset
Model
Module 3:
Frame Model
conditioned on
onset prediction
Module 2
Onset
Frame
teach onsets and
frames separately.
2. It helps to predict
onsets; and then
frames conditioned
on onsets

Module 1:
Onset
Model
Module 3:
Frame Model
conditioned on
onset prediction
Module 2
Onset
Frame
teach onsets and
frames separately.
2. It helps to predict
onsets; and then
frames conditioned
on onsets
3. Melspectrograms
are good enough.

Model Dataset
MAPS
(2010)
MAPS w/ diﬀ
config/metric
Maestro
(2018)
FitzGerald et al.

2008
58
Vincent et al.

2010
67
Ewert et al. 2016 95
Kelz et al. 2016 79 51 81
Hawthorne et al.
2018
83 82
Hawthorne et al.
2019
83 86 - 95
Kong et al.
2020
97
🚨 DISCLAIMER: Some numbers are probably incorrect.

The metrics / datasets are complicated 🤯

Model Dataset
MAPS
(2010)
MAPS w/ diﬀ
config/metric
Maestro
(2018)
FitzGerald et al.

2008
58
Vincent et al.

2010
67
Kelz et al. 2016 79 51 81
Hawthorne et al.
2018
83 82
Hawthorne et al.
2019
83 86 - 95
Kong et al.
2020
97
Deep learning


Model Dataset
MAPS
(2010)
MAPS w/ diﬀ
config/metric
Maestro
(2018)
FitzGerald et al.

2008
58
Vincent et al.

2010
67
Kelz et al. 2016 79 51 81
Hawthorne et al.
2018
83 82
Hawthorne et al.
2019
83 86 - 95
Kong et al.
2020
97
Deep learning
Better deep learning (Onsets-and-frames)


DEMO: Real-time Piano Transcription (Kwon et al., 2020)
1. Recording
2. Transcription
3. Result

Next Paradigm: Analysis-and-Synthesis; jointly
• 🎹 wave2midi2wave (Hawthorne et al., 2019)

• Utilized a paired midi-audio dataset

• 🥁 DrummerNet (Choi and Cho, 2019)

• Audio-only; unsupervised learning

• Followed by: guitar (Wiggins and Kim, 2020) and piano (Cheuk et al., 2021,
Benetos et al., 2021)

DrummerNet

DrummerNet
Module 1: Transcription
Synthesize drum signals
(not trainable; not deep learning)

DrummerNet
1. If the transcription
works well,

DrummerNet
works well,
2. The synthesized audio
based on the
transcription should be..

DrummerNet
works well,
based on the
3. Similar to the
input audio!

DrummerNet
works well,
based on the
3. Similar to the
input audio!
(i.e., autoencoder)

Discussion: Why always piano/drums? [1/3]
• Models work well for 🎹 piano and 🥁 drums

• Piano and drums; both have great virtual instruments.

• Large datasets are all midi-based synthetic ones

Discussion [2/3]
• Models don't work well for instruments with time-varying nature. 
(🎷🎺 horns and woodwind, 🎻 strings)

• Reason: Lack of training data; Why? because midi sucks for those instruments

• The "time-varying" nature is partly instrument inherent; but also from the information
players add to the score

• This is because the limit of the information represented in "scores📄"

• Do scores and notes really matters in pop music? Do DJs care? Electric guitarists?

Discussion [3/3]
• The need for special models for various instruments → not great

• Are there clear boundaries between instruments? → not always

Lyric Understanding
Lyric Alignment

Lyric Alignment
• Sequence alignment is a VERY popular problem.

• Text, DNA, speech, music, ..

• Methods are waiting to be imported to music.

• If needed, vocal separation works so well.

• Public datasets are relatively small.

• Seems like it's a lot advanced in industry where karaoke is a thing.
Lyric Understanding

Lyric transcription
• Methods are all there already.

• Pre-trained speech recognition models don't work.

• We just need to train a model for this task but..

• Data 😭...😭😭😭

• Copyright 😭😭......
Lyric Understanding

Signal Analysis Audio Signal Processing
Information Creation Audio Synthesis
Music AI
Loudness, Pitch, and the rest (=Timbre)
Three Components of Sound

Remark
• We've tried speech / language models quite enough. Let's focus on the diﬀerence!

• Unlike speech, music is polyphonic; a lot more poly-timbral.

• Compared to language, the information in the score is a lot more limited

Remark


• Music datasets are "tiny"; but that's a part of the problem we should solve.

Remark


• Music datasets are "tiny"; but that's a part of the problem we should solve.
• Unlike language / speech / images / videos,

• The music creation process is heavily, and nicely digitized.

• → While they crawl, we can synthesize.

Like music AI?
• ISMIR ♥ - International Society of Music Information Retrieval

Like music AI?
• Creativity Workshops in NeurIPS / ICML

Like music AI?
• ICASSP, SMC

Like music AI?
• ICASSP, SMC
• Lab showcase at ISMIR2021:  
https://ismir2021.ismir.net/labshowcase/

- The End -
All you need is AI and music
DS-GA 1011 F'21
Keunwoo Choi, , 2021-12-08

"All you need is AI and music" by Keunwoo Choi

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to "All you need is AI and music" by Keunwoo Choi

Similar to "All you need is AI and music" by Keunwoo Choi (20)

More from Keunwoo Choi

More from Keunwoo Choi (11)

Recently uploaded

Recently uploaded (20)

"All you need is AI and music" by Keunwoo Choi