20211026 taicca 1 intro to mir

http://mac.citi.sinica.edu.tw/~yang/
yhyang@ailabs.tw
yang@citi.sinica.edu.tw
Yi-Hsuan Yang Ph.D. 1,2
1 Taiwan AI Labs
2 Research Center for IT Innovation, Academia Sinica
October 26, 2021

About Me
• Ph.D. National Taiwan University 2010
• Research Professor, Music & AI Lab, Academia Sinica, since 2011
• Chief Music Scientist, Taiwan AI Labs, 2019/3‒2023/2
• Over 200 publications
2

About the Music and AI Lab @ Sinica
About Academia Sinica
 National academy of Taiwan, founded in 1928
 About 1,000 Full/Associate/Assistant Researchers
About Music and AI Lab (musicai)
 Since Sep 2011
 Members
 PI [me]
 research assistants
 PhD/master students
3

About the Music AI Team @
About Taiwan AI Labs
 Privately-funded research organization,
founded by Ethan Tu (PTT) in 2017
 Three main research area: 1) HCI, 2) medicine, 3) smart city
About the Music AI team
 Members
 scientist [me; since March 2019]
 ML engineers (for models)
 musicians
 program manager
 software engineers (for frontend/backend)
4
(an image of our musicians)

Outline
• Types of music related research/products
• Fundamentals of music signal processing
• Types of data
9

Types of Music Related Research/Products
10
• Intelligent ways to analyze, retrieve, and create music
1. Music informa-
tion analysis
2. Music informa-
tion retrieval
3. Music
generation
music → features query → music X → music

1. Music information
“analysis”
11
automatic page turner
automatic
Karaoke scoring
interactive
concert

1. Music information
“analysis”
12
chord recognizer music browsing assistant

2. Music information “retrieval”
• Search
‒ through keywords/labels (genre, instrument, emotion)
13

• Search
14
musical event localization
J.-Y. Liu and Y.-H. Yang, "Event localization in music auto-tagging," MM 2016

• Search
‒ through audio examples (humming, audio recording)
15

• Search
‒ through audio examples (humming, audio recording)
16
…

• Match
‒ to match 1) a video clip, 2) a photo slideshow,
3) a song lyrics, or 4) a given context
‒ cross-domain retrieval
17

• Discover
‒ recommendation: diversity, serendipity, explanations
18

• Discover
‒ recommendation: diversity, serendipity, explanations
19

Context
Music User
• Activity: driving, studying, working, walking
• Mood: happy, sad, angry, relaxed
• Location: home, work, public place
• Social company: alone, w/ friends, w/ strangers
• age
• gender
• personality
• cultural background
• musical background
• Discover
‒ Context-aware Music Recommendation

3. Music creation
21

3. Music creation
22
https://www.youtube.com/watch?v=k1DgNfz1g_s

3. Music creation
23

3. Music creation
24
http://www.inside.com.tw/2016/05/04/positive-grid-bias-head

3. Music creation
25
https://youtu.be/rL5YKZ9ecpg?t=50m

1. Music information analysis
• Education, data visualization
2. Music information retrieval
• Search: through keywords (genre, instrument, emotion) or
audio examples (humming or audio recording)
• Match: cross domain retrieval
• Discover: recommendation
3. Music creation
• Google Magenta, Smule AutoRap, Samsung Hum-On,
Positive Grid, Yamaha Vocaloid
26

ML in Music: “Music Info Retrieval/Analysis”
28
Music transcription (audio2score)
• audio → note (pitch, onset, offset)
• audio → instrument (flute, cello)
• audio → meter (4/4)
• audio → key (E-flat major)
audio score
Music semantic labeling
• audio → genre (classical)
• audio → emotion (yearning)
• audio → other attributes (slow/fast)
labels
applications in
music retrieval,
education,
archival, etc
(existing
song)
AI listener

ML in Music: “Music Generation/Synthesis”
29
audio score
labels
(new
song)
AI composer
random seed
AI performer (score2audio)

ML in Music: “Music Generation/Synthesis”
30
audio
features
labels
(existing
songs)
AI listener
score
AI DJ
audio
(a new
song)
remix, mashup, etc
(image from the Internet)

Music AI Research
• Four broad topics
 audio → audio: signal processing
 audio → score: transcription
 score → score: composition
 score → audio: synthesis
31

Outline
• Types of data
32

Fundamentals of Music Signal Processing
• Pitch: which notes are played?
• Rhythm: how fast?
• Timbre: which instrument(s)?
33
Mozart’s Variationen
(1st phrase)

Music Information Analysis
• music → features
34
melody
1. pitch
2. onset, offset
3. tempo

35
accompaniment
4. chord

36
5. instruments (timbre)

37
6. source separation
7. key, beat, downbeat, meter

38
8. Semantic description
─ genre: pop, classical, jazz, rock, R&B, “Tai,” “aboriginal”
─ emotion: happy, angry, sad, relaxed
─ usage: at party, working, driving, reading, sleeping, romance
─ theme: lonely, breakup, celebration, in love, friend, battle
─ vocal timbre: aggressive, breathy, duet, emotional, rapping, screaming
genre listening context
emotion

Pitch ♪♪♪ ♪♪♪ ♪♪♪
Rhythm
Timbre ♪♪
39
Karaoke scorer chord recognizer
page turner

Pitch ♪♪♪ ♪
Rhythm ♪♪♪
Timbre ♪♪♪ ♪
40
instrument
classifier
content ID Spotify running

Pitch ♪♪♪ ♪♪♪ ♪♪♪
Rhythm ♪♪♪ ♪♪♪ ♪♪♪
Timbre ♪♪♪ ♪♪♪ ♪♪♪
41
similarity search
or
recommendation
music
emotion or
genre
recognizer
automatic
music video
generation

42
• Listens to music
tempo, instrumentation,
key, time signature, energy,
harmonic & timbral structures
• Reads about music
lyrics, blog posts, reviews,
playlists and discussion forums
• Learns about trends
online music behavior — who's
talking about which artists this
week, what songs are being
streamed or downloaded
• Not everything is in audio

• Let’s have a look at what we can extract from audio
anyway
• Time-domain waveform
43

• Frequency domain
representation
• Spectrogram (obtained
by Short-Time Fourier
Transform)
44

• Pitch
• Simple for monophonic
signals (almost table
lookup)
• Challenging for polyphonic
signals; known as multi-
pitch estimation (MPE)
‒ overlapping partials
‒ missing fundamentals
45
8ve
8ve
8ve
8ve
8ve
L. Su and Y.-H. Yang, "Combining spectral and temporal representations for multipitch
estimation of polyphonic music,“ TASLP 2015

• Tempo: beats
per minute (bpm)
• Onset detection,
downbeat estimation
tempo estimation,
beat tracking,
rhythm pattern
extraction
48
energy-based spectrum-based

• Timbre: difference in time-frequency distribution
50

• Timbre: difference in time-frequency distribution
‒ odd-to-even harmonic ratio, decay rate, vibrato etc
51
piano solo human voice

• Spectrogram, or the reduced-dimension version “Mel-
spectrogram,” is usually considered as a “raw” feature
representation of music
• Can be treated as an image and then processed by
convolutional neural nets (CNN)
52
figure made by
Sander Dieleman
http://benanne.github.io/2014/
08/05/spotify-cnns.html

• Chromagram: a better “timbre-invariant” feature
representation for pitch related tasks (e.g. chord
recognition, cover song identification)
‒ merge all the frequency bins
with the same note name
(C, C#, D, D#, …)
‒ 12-dim vector for each
time frame
53
figure made by
Meinard Meuller

• Source separation can sometimes be helpful
‒ harmonic/percussion separation: given a mixture, separate
the percussive part from the harmonic part
‒ harmonic: pitch related info
‒ percussive: tempo related info
54
(a) original (b) harmonic (c) percussive

• Source separation can sometimes be helpful
‒ singing voice separation: given a mixture, separate the
singing voice from the accompaniment
55

• Pitch, tempo, timbre play different roles in different
tasks
• Spectrogram: a basic feature representation
• Multipitch estimation: for better pitch info
• Source separation: might improve the extraction for
pitch, tempo and also timbre
• Feature design (based on domain knowledge) versus
feature learning (data-driven; deep learning)
56

Outline
• Types of data
57

Types of Data
• Music audio data
─ not sharable due to copyright issues and business interest
─ however, audio features can be shared
─ or, start with copyright free music
58
free music
archive

Types of Data
• Music listening data
‒ from social platforms via e.g., last.fm API, Spotify API
‒ from Twitter: #nowplaying dataset
59

Types of Data
• Big music text data
─ score, lyrics, review, playlist, tags, Wikipedia, etc
─ not everything is in audio
─ some of them are easier to get from non-audio data
60

Types of Data
• Big sensor data?
─ sensors attached to “things” or “human beings”
61

Data Science in Music
• The missing “D” in Data Science —
domain knowledge
• Music information retrieval
= musicology
+ signal processing
+ machine learning
+ others
62

Resources
• Conference proceedings
‒ Int’l Soc. Music Information Retrieval Conf. (ISMIR)
‒ Int’l Conf. Acoustic, Speech, and Signal Processing (ICASSP)
‒ AAAI, IJCAI, ICML, NeurIPS, ICLR, ACM MM
• Transactions
‒ Transactions of the Int’l Soc. Music Information Retrieval
(TISMIR)
‒ IEEE Trans. Audio, Speech and Language Processing (TASLP)
‒ IEEE Trans. Multimedia (TMM)
63

Resources
• MIREX (MIR Evaluation eXchange)
‒ Part of ISMIR
‒ http://www.music-ir.org/mirex/wiki/MIREX_HOME
 Audio Onset Detection
 Audio Beat Tracking
 Audio Key Detection
 Audio Downbeat Detection
 Real-time Audio to Score
Alignment(a.k.a Score Following)
 Audio Cover Song Identification
 Discovery of Repeated Themes &
Sections
 Audio Melody Extraction
 Query by Singing/Humming
 Audio Chord Estimation
 Singing Voice Separation
 Audio Fingerprinting
 Music/Speech
Classification/Detection
 Audio Offset Detection

Resources
• Courses
‒ Juhan Nam @ KAIST
https://mac.kaist.ac.kr/~juhan/gct634/index.html
‒ Meinard Meuller @ Universität Erlangen-Nürnberg
https://www.audiolabs-
erlangen.de/fau/professor/mueller/teaching
‒ Juan Bello @ NYU
https://wp.nyu.edu/jpbello/teaching/mir/
‒ CCRMA summer school @ Stanford
https://ccrma.stanford.edu/workshops/music-
information-retrieval-mir-2015
‒ Xavier Serra @ UPF, Spain
https://zh-tw.coursera.org/course/audio
65

20211026 taicca 1 intro to mir

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 20211026 taicca 1 intro to mir

Similar to 20211026 taicca 1 intro to mir (20)

More from Yi-Hsuan Yang

More from Yi-Hsuan Yang (8)

Recently uploaded

Recently uploaded (20)

20211026 taicca 1 intro to mir