http://mac.citi.sinica.edu.tw/~yang/
yhyang@ailabs.tw
Yi-Hsuan Yang Ph.D. 1,2
1 Taiwan AI Labs
2 Research Center for IT Innovation, Academia Sinica
About the Music and AI Lab @ Sinica
• About Academia Sinica
 National academy of Taiwan, founded in 1928
 About 1,000 Full/Associate/Assistant Researchers
• About Music and AI Lab (musicai)
 Since Sep 2011
 Members
PI [me]
research assistants
PhD/master students
3 AAAI full papers + 3 IJCAI full papers in last two years
 MidiNet, MuseGAN, PerformanceNet, etc
2
About the Music AI Team @ Taiwan AI Labs
• About Taiwan AI Labs
 Privately-funded research organization (like openAI),
founded by Ethan Tu (PTT) in 2017
 Three main research area: 1) HCI, 2) medicine, 3) smart city
• About the Music AI team
 Members
scientist [me]
ML engineers (for models)
musicians
program manager
software engineers (for frontend/backend)
3
Examples of ML Models for Music
• YouTube Content ID
https://support.google.com/youtube/answer/2822002?hl=en
4
Figure source: https://www.researchgate.net/figure/State-of-the-art-audio-fingerprinting-algorithms_fig1_320856220
Examples of ML Models for Music
• Recommendation
 user feedback based (collaborative filtering)
 audio content based
5
Music Information “Retrieval”
• (i) Radio → (ii) CDs → (iii) downloading music
→ (iv) listening to music online
• Paradigm shift
(i→ii) DJ’s selec on → your selec on
(ii→iii) Internet, PC, walkman, phones
(iii→iv) a few thousands → million songs (anytime,
anywhere, and “any song”)
6
Music Information “Retrieval”
• Research focus
(iii: PC+Web) auto-classification, similarity search
(iv: million songs) recommendation
7
Machine
learning
Signal
processing Data
mining
Information
retrieval
Music
psychology
Musicology
Human
computer
interaction
Music Information “Retrieval”
• Relevant conferences
International Computer Music Conference (ICMC),
since 1975
IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), since 1976
ACM Multimedia (MM), since 1993
IEEE International Conference on Image Processing
(ICIP), since 1994
International Society for Music Information
Retrieval Conference (ISMIR), since 2000
8
ISMIR
• https://www.ismir.net/conferences/
9
• https://ismir2019.ewi.tudelft.nl/
?q=accepted-papers
• http://www.terasoft.com.tw/con
f/ismir2014/ConferenceProgram.
html
MIREX
• https://www.music-
ir.org/mirex/wiki/2019:
MIREX2019_Results
10
11
Music transcription (audio2score)
• audio → note (pitch, onset, offset)
• audio → instrument (flute, cello)
• audio → meter (4/4)
• audio → key (E-flat major)
audio score
ML in Music: “Music Info Retrieval/Analysis”
(existing
song)
ML in Music: “Music Info Retrieval/Analysis”
12
Music transcription (audio2score)
• audio → note (pitch, onset, offset)
• audio → instrument (flute, cello)
• audio → meter (4/4)
• audio → key (E-flat major)
audio score
Music semantic labeling
• audio → genre (classical)
• audio → emotion (yearning)
• audio → other attributes (slow/fast)
labels
applications in
music retrieval,
recommendation,
education, etc
(existing
song)
AI listener
AI Listener: Source Separation
• “Demix” the music signal
 input: audio mixture
output: individual tracks
14
(image from the Internet)
AI Listener: Source Separation
• “Demix” the music signal
• Applications
 As a preprocessing step for music transcription
(to create data for training AI composers or performers)
 AI DJ (e.g., mashup)
 Karaoke, soundtracks for movies
 Singing voice processing
 Smart headset, smart loudspeaker
 Education
15
Open Source Tool: “Open-unmix”
• https://open.unmix.app/
19
Stöter, Uhlich, Liutkus, and Mitsufuji, “Open-Unmix - A reference implementation
for audio source separation,” Journal of Open Source Software 2019
• 6.32 db SDR in vocal separation for the SiSEC2018 dataset
AI Listener: Music transcription
https://magenta.tensorflow.org/onsets-frames
• Applications
 Support the training
of AI composer and
AI performer
• SOTA
 “Onsets and Frames”
(Google Magenta)
 Piano only
21
https://magenta.tensorflow.org/onsets-frames
ML in Music: “Music Info Retrieval/Analysis”
23
Music transcription (audio2score)
• audio → note (pitch, onset, offset)
• audio → instrument (flute, cello)
• audio → meter (4/4)
• audio → key (E-flat major)
audio score
Music semantic labeling
• audio → genre (classical)
• audio → emotion (yearning)
• audio → other attributes (slow/fast)
labels
applications in
music retrieval,
recommendation,
education, etc
(existing
song)
AI listener
Music transcription (audio2score)
• audio → note (pitch, onset, offset)
• audio → instrument (flute, cello)
• audio → meter (4/4)
• audio → key (E-flat major)
ML in Music: “Music Generation/Synthesis”
24
audio score
Music semantic labeling
• audio → genre (classical)
• audio → emotion (yearning)
• audio → other attributes (slow/fast)
labels
(new
song)
AI composer
random seed
AI performer (score2audio)
AI Composer
25IBM Waston Beats
• Make musicians’ life easier
 inspire ideas
 suggest continuations
 suggest accompaniments
• Empower everyone to
make music
democratization of music creation
• Create copyright free
music for videos or games
• Music education
Examples from Image Generation
Brock et al., “Large scale GAN training for high fidelity natural image synthesis,” ICLR 2019
latent space
data space
Finding the “latent space” of data
Latent Space Exploration
http://vibertthio.com/drum-vae-client/
27
Latent Space Interpolation
Marron 5
Payphone
The Beatles
Hey Judelatent space interpolation
Liu and Yang, “Lead sheet generation and arrangement via a hybrid generative model,” ISMIR-LBD 2018
Lu et al., “Play as You Like: Timbre-enhanced multi-modal music style transfer,” AAAI 2019
piano guitar
Demo: Magenta Studio
https://magenta.tensorflow.org/studio/
Examples from Image Generation
30
Music style transfer
• Alter the “style,” but keep the “content” fixed
• Three types of music style transfer
1. composition style transfer for score
2. performance style transfer for performance control
3. timbre style transfer for sound
Dai et al., “Music style transfer: A position paper,” MUME 2018
Singing Style Transfer
Wu et al., “Singing style transfer using cycle-consistent boundary equilibrium
generative adversarial networks,” ICML workshop 2018
• Transfer between female and male singing voices
http://mirlab.org/users/haley.wu/cybegan/
Music AI Research (Common Approach)
• Algorithmic composition
“MIDI in, MIDI out”
• Limitations
Lack expressivity: cannot be directly listened to
Some music genres are not “written language”
47
NLU
NLG
(Music encoding used by openAI’s MuseNet model)
Music AI Research (at the Taiwan AILabs)
49
• “audio in, audio out”
 audio → audio: source separation (SS) [denoising]
 audio → score: music transcription (MT) [ASR]
 score → score: composition [NLG]
 score → audio: synthesis [TTS]
The “MIR4generation” Pipeline
50
• The transcription model predicts the pitch, onset/offset
timing (in absolute time), and velocity (dynamics)
 and the beat/downbeat model provides the underlying metrical grid
(in symbolic time) of the music
• So that the composition model can also learn to generate
expressive music that can be directly listened to
Demo
51
We focus on piano music currently
https://soundcloud.com/yating_ai/sets/ismir-2019-submission/
https://www.youtube.com/watch?v=9ZIJrr6lmHg
Interactivity: Jamming with Yating
52
• Yeh et al., “Learning to generate Jazz and Pop piano music from audio via MIR
techniques,” ISMIR-LBD 2019
• Hsiao et al., “Jamming with Yating: Interactive demonstration of a music
composition AI,” ISMIR-LBD 2019
The Many “Roles” of AI Musicians
• Music information retrieval/analysis
 AI listener
 Music source separation (audio → audio)
 Music transcription (audio → score)
 Music semantic labeling (audio → label)
 For analyzing and indexing existing songs
• Music generation/synthesis
 AI composer (random seed → score)
 AI performer (score → audio)
 For creating new music
61

machine learning x music

  • 1.
    http://mac.citi.sinica.edu.tw/~yang/ yhyang@ailabs.tw Yi-Hsuan Yang Ph.D.1,2 1 Taiwan AI Labs 2 Research Center for IT Innovation, Academia Sinica
  • 2.
    About the Musicand AI Lab @ Sinica • About Academia Sinica  National academy of Taiwan, founded in 1928  About 1,000 Full/Associate/Assistant Researchers • About Music and AI Lab (musicai)  Since Sep 2011  Members PI [me] research assistants PhD/master students 3 AAAI full papers + 3 IJCAI full papers in last two years  MidiNet, MuseGAN, PerformanceNet, etc 2
  • 3.
    About the MusicAI Team @ Taiwan AI Labs • About Taiwan AI Labs  Privately-funded research organization (like openAI), founded by Ethan Tu (PTT) in 2017  Three main research area: 1) HCI, 2) medicine, 3) smart city • About the Music AI team  Members scientist [me] ML engineers (for models) musicians program manager software engineers (for frontend/backend) 3
  • 4.
    Examples of MLModels for Music • YouTube Content ID https://support.google.com/youtube/answer/2822002?hl=en 4 Figure source: https://www.researchgate.net/figure/State-of-the-art-audio-fingerprinting-algorithms_fig1_320856220
  • 5.
    Examples of MLModels for Music • Recommendation  user feedback based (collaborative filtering)  audio content based 5
  • 6.
    Music Information “Retrieval” •(i) Radio → (ii) CDs → (iii) downloading music → (iv) listening to music online • Paradigm shift (i→ii) DJ’s selec on → your selec on (ii→iii) Internet, PC, walkman, phones (iii→iv) a few thousands → million songs (anytime, anywhere, and “any song”) 6
  • 7.
    Music Information “Retrieval” •Research focus (iii: PC+Web) auto-classification, similarity search (iv: million songs) recommendation 7 Machine learning Signal processing Data mining Information retrieval Music psychology Musicology Human computer interaction
  • 8.
    Music Information “Retrieval” •Relevant conferences International Computer Music Conference (ICMC), since 1975 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), since 1976 ACM Multimedia (MM), since 1993 IEEE International Conference on Image Processing (ICIP), since 1994 International Society for Music Information Retrieval Conference (ISMIR), since 2000 8
  • 9.
  • 10.
  • 11.
    11 Music transcription (audio2score) •audio → note (pitch, onset, offset) • audio → instrument (flute, cello) • audio → meter (4/4) • audio → key (E-flat major) audio score ML in Music: “Music Info Retrieval/Analysis” (existing song)
  • 12.
    ML in Music:“Music Info Retrieval/Analysis” 12 Music transcription (audio2score) • audio → note (pitch, onset, offset) • audio → instrument (flute, cello) • audio → meter (4/4) • audio → key (E-flat major) audio score Music semantic labeling • audio → genre (classical) • audio → emotion (yearning) • audio → other attributes (slow/fast) labels applications in music retrieval, recommendation, education, etc (existing song) AI listener
  • 13.
    AI Listener: SourceSeparation • “Demix” the music signal  input: audio mixture output: individual tracks 14 (image from the Internet)
  • 14.
    AI Listener: SourceSeparation • “Demix” the music signal • Applications  As a preprocessing step for music transcription (to create data for training AI composers or performers)  AI DJ (e.g., mashup)  Karaoke, soundtracks for movies  Singing voice processing  Smart headset, smart loudspeaker  Education 15
  • 15.
    Open Source Tool:“Open-unmix” • https://open.unmix.app/ 19 Stöter, Uhlich, Liutkus, and Mitsufuji, “Open-Unmix - A reference implementation for audio source separation,” Journal of Open Source Software 2019 • 6.32 db SDR in vocal separation for the SiSEC2018 dataset
  • 16.
    AI Listener: Musictranscription https://magenta.tensorflow.org/onsets-frames • Applications  Support the training of AI composer and AI performer • SOTA  “Onsets and Frames” (Google Magenta)  Piano only 21 https://magenta.tensorflow.org/onsets-frames
  • 17.
    ML in Music:“Music Info Retrieval/Analysis” 23 Music transcription (audio2score) • audio → note (pitch, onset, offset) • audio → instrument (flute, cello) • audio → meter (4/4) • audio → key (E-flat major) audio score Music semantic labeling • audio → genre (classical) • audio → emotion (yearning) • audio → other attributes (slow/fast) labels applications in music retrieval, recommendation, education, etc (existing song) AI listener
  • 18.
    Music transcription (audio2score) •audio → note (pitch, onset, offset) • audio → instrument (flute, cello) • audio → meter (4/4) • audio → key (E-flat major) ML in Music: “Music Generation/Synthesis” 24 audio score Music semantic labeling • audio → genre (classical) • audio → emotion (yearning) • audio → other attributes (slow/fast) labels (new song) AI composer random seed AI performer (score2audio)
  • 19.
    AI Composer 25IBM WastonBeats • Make musicians’ life easier  inspire ideas  suggest continuations  suggest accompaniments • Empower everyone to make music democratization of music creation • Create copyright free music for videos or games • Music education
  • 20.
    Examples from ImageGeneration Brock et al., “Large scale GAN training for high fidelity natural image synthesis,” ICLR 2019 latent space data space Finding the “latent space” of data
  • 21.
  • 22.
    Latent Space Interpolation Marron5 Payphone The Beatles Hey Judelatent space interpolation Liu and Yang, “Lead sheet generation and arrangement via a hybrid generative model,” ISMIR-LBD 2018 Lu et al., “Play as You Like: Timbre-enhanced multi-modal music style transfer,” AAAI 2019 piano guitar
  • 23.
  • 24.
    Examples from ImageGeneration 30
  • 25.
    Music style transfer •Alter the “style,” but keep the “content” fixed • Three types of music style transfer 1. composition style transfer for score 2. performance style transfer for performance control 3. timbre style transfer for sound Dai et al., “Music style transfer: A position paper,” MUME 2018
  • 26.
    Singing Style Transfer Wuet al., “Singing style transfer using cycle-consistent boundary equilibrium generative adversarial networks,” ICML workshop 2018 • Transfer between female and male singing voices http://mirlab.org/users/haley.wu/cybegan/
  • 27.
    Music AI Research(Common Approach) • Algorithmic composition “MIDI in, MIDI out” • Limitations Lack expressivity: cannot be directly listened to Some music genres are not “written language” 47 NLU NLG (Music encoding used by openAI’s MuseNet model)
  • 28.
    Music AI Research(at the Taiwan AILabs) 49 • “audio in, audio out”  audio → audio: source separation (SS) [denoising]  audio → score: music transcription (MT) [ASR]  score → score: composition [NLG]  score → audio: synthesis [TTS]
  • 29.
    The “MIR4generation” Pipeline 50 •The transcription model predicts the pitch, onset/offset timing (in absolute time), and velocity (dynamics)  and the beat/downbeat model provides the underlying metrical grid (in symbolic time) of the music • So that the composition model can also learn to generate expressive music that can be directly listened to
  • 30.
    Demo 51 We focus onpiano music currently https://soundcloud.com/yating_ai/sets/ismir-2019-submission/
  • 31.
    https://www.youtube.com/watch?v=9ZIJrr6lmHg Interactivity: Jamming withYating 52 • Yeh et al., “Learning to generate Jazz and Pop piano music from audio via MIR techniques,” ISMIR-LBD 2019 • Hsiao et al., “Jamming with Yating: Interactive demonstration of a music composition AI,” ISMIR-LBD 2019
  • 32.
    The Many “Roles”of AI Musicians • Music information retrieval/analysis  AI listener  Music source separation (audio → audio)  Music transcription (audio → score)  Music semantic labeling (audio → label)  For analyzing and indexing existing songs • Music generation/synthesis  AI composer (random seed → score)  AI performer (score → audio)  For creating new music 61