http://mac.citi.sinica.edu.tw/~yang/
yhyang@ailabs.tw
yang@citi.sinica.edu.tw
Yi-Hsuan Yang Ph.D. 1,2
1 Taiwan AI Labs
2 Research Center for IT Innovation, Academia Sinica
October 26, 2021
Well-Established Music Technology:
“Making” Sounds
2
• Music synthesizers that make realistic sounds (e.g,. electric
piano) and new sounds (e.g., electric guitar)
• Based on digital signal processing
Sources:
https://www.wardbrodt.com/blog/history-of-the-electronic-keyboard-infographic-madison-wisconsin
https://www.musicnexo.com/blog/en/history-of-the-electric-guitar-eternal-youth/
https://freesound.org/people/karolist/sounds/370934/
Well-Established Music Technology:
“Altering” Sounds
3
Well-Established Music Technology:
“Altering” Sounds
4
(image from https://www.youtube.com/watch?v=u2g2tH0yb_Q)
Well-Established Music Technology:
“Mixing” Sounds
5
(image from https://midination.com/daw/best-daw-software/)
6
velocity, duration,
articulation, playing
technique, etc
timbre
“what”
to play
“how”
to play
use “what”
to play
Music: Multiple aspects
7
velocity, duration,
articulation, playing
technique, etc
timbre
“what”
to play
“how”
to play
use “what”
to play
Well-accepted
good computer tools
Human privilege
Emerging Music Technology:
Making Music
8
• Computers that can understand (existing) music and create
new music compositions & performances
• Based on machine learning
• Example: “Bach Doodle” by Google in 2019
(https://www.google.com/doodles/celebrating-johann-sebastian-bach)
https://youtu.be/gsUV0mGEGaY
(image is from Google Magenta’s website)
Emerging Music Technology:
Making Music
9
(image from https://www.youtube.com/watch?v=u2g2tH0yb_Q)
Increasing Interest
(images are from the internet)
AI Music: Examples
11
Original Jazz Lazy
By Sony CSL
AI Music: Examples
12
https://www.youtube.com/watch?v=Emidxpkyk6o
https://www.youtube.com/watch?v=eLc3Y0SShFY
AI Music: Examples
13
https://www.youtube.com/watch?v=rTuK4iqQtPI
AI Music: Examples
https://magenta.tensorflow.org/studio/
14
“Continue,” “Generate 4 bars,” “Drumify,” “Interpolate,” “Groove”
(image is from Google Magenta’s website)
AI Music: Examples
15
https://magenta.tensorflow.org/pianogenie
AI Music: Examples
16
AI Music: Examples
17
https://sites.research.google/tonetransfer
Ethical Concerns
18
(comments of AIVA’s “I am AI” album)
Ethical Concerns
19
• “As AI begins to reshape how music is made, our
legal systems are going to be confronted with some
messy questions regarding authorship.”
• “Do AI algorithms create their own work, or is it the
humans behind them?”
• “What happens if AI software trained solely on
Beyoncé creates a track that sounds just like her?”
Positive Views
https://www.youtube.com/watch?v=wYb3Wimn01s
20
(Pierre Barreau, the CEO of AIVA)
22
velocity, duration,
articulation, playing
technique, etc
timbre
“what”
to play
“how”
to play
use “what”
to play
Well-accepted
good computer tools
Human privilege
How to Build Such an AI?
23
https://www.eslite.com/product/1001117242682044820004
How to Build Such an AI?
24
https://www.unite.ai/artificial-general-intelligence-agi/
Actually…
25
• Based on probabilities
• Given “xt-1” (the last one), predict “xt” (the current one)
https://towardsdatascience.com/markov-chain-for-music-
generation-932ea8a88305
Actually…
26
• Based on probabilities
• Given “xt-1” (the last one), predict “xt” (the current one)
• Given “x1, …, xt-2, xt-1” (all the past), predict “xt” (the
current one)
deep neural networks
Solving Math
• A piece of music is considered as a sequence of “events”
• A model is trained to predict “what comes next”, given
the history of previous events
P(xt | xt-1, xt-2, xt-3, …),
“As we listen to melodies, our brain also guesses what's next”
https://bigthink.com/surprising-science/music-brain-predict
27
Solving Math
• A piece of music is considered as a sequence of “events”
• A model is trained to predict “what comes next”, given
the history of previous events
P(xt | xt-1, xt-2, xt-3, …), xt S
28
a finite collection of possible events
Core Questions
• A piece of music is considered as a sequence of “events”
• A model is trained to predict “what comes next”, given
the history of previous events
P(xt | xt-1, xt-2, xt-3, …), xt S
1. How to turn music into events “S”
2. How to solve the equation
29
RNNs
30
(images are from the internet)
RNNs vs. Transformers
• RNNs summarize the history with a single latent vector
• Transformers represent each of the previous event by a
separate latent vector
31
(images are from the internet)
Representing Music as Events
32
• Melody is easier
• But, how about…
• A music piece is viewed as a token sequence
• A musical note is described by three tokens
Representing Music as Events
33
pitch
duration
velocity
Music Transformer: Generating Music with Long-Term Structure, ICLR 2019
• How to represent “time” in music using a finite set of
events?
Representing Music as Events
34
pitch
duration
velocity
Music Transformer: Generating Music with Long-Term Structure, ICLR 2019
• Add one additional token to mark “∆T” (interval time)
Representing Music as Events
35
∆T =0
pitch
duration
velocity
a
b
• And, one additional token to mark “∆T” (interval time)
Representing Music as Events
36
pitch
duration
velocity
∆T =50ms
a c
A Problem
• “∆T” does not work well for pop music
1. The generated music lacks stable rhythmic structure
2. The model has hard time to “count the beats”
3. Errors in “∆T” accumulate
37
∆T =50ms
a c
Our Proposal
• Mark the bar lines
• Indicate the position of a note within a bar
38
bar
subbeat
pitch
duration
velocity
∆T
Our Proposal
• Mark the bar lines
• Indicate the position of a note within a bar
39
bar
subbeat
pitch
duration
velocity
the same
‘subbeat‘ token
a
b
Token Engineering
• The input representation of music matters
40
Representing Multi-instrument Music
• Add instrument-related tokens
41
(image from OpenAI’s blog on their MuseNet model)
Research at “Yating” at Taiwan AI Labs
 Generate scores
• Piano music
• Guitar music
• Piano/violin/bass/drum
 Generate sounds
• MIDI-free & lyrics-free sing to piano
• MIDI-free & lyrics-free improvisational singing
• MIDI-free & lyrics-cond. improvisational singing
• Guitar
• Violin, flute
• EDM Loops
42
2017 (midinet)
2021 (pop music transformer)
https://soundcloud.com/affige/midinet-
2017?si=40bc1caf82f143f4ae237c9702e3e2e7
https://soundcloud.com/affige/pop-music-transformer-
2021?si=fe3ee3b2804d496b881f319cd4c10535
Research at “Yating” at Taiwan AI Labs
• Automatic generation of pop/jazz piano performances
‒ Unconditioned generation [MM 2020, AAAI 2021, ICML 2021]
‒ Lead-sheet conditioned generation [AAAI 2021]
‒ Emotion-conditioned generation [ISMIR 2021a] (arxiv: 2108.01374)
‒ Infilling [ISMIR 2021b] (arxiv: 2108.05064)
‒ Style transfer [arXiv 2021a] (arxiv: 2105.04090)
• Automatic generation of singing voices
‒ Unconditioned generation
‒ Lyrics conditioned generation
• Automatic generation of audio from other instruments
‒ Guitar, violin, flute, EDM loops
• Music understanding technology [arXiv 2021b] (arxiv: 2107.05223)
• User Interaction
• Image Generation 43
First Attempt: Convolutional Generative
Adversarial Network (GAN)
 [ISMIR 2017] “MidiNet: A convolutional generative adversarial
network for symbolic-domain music generation” (305 cites)
 [AAAI 2018] “MuseGAN: Multi-track sequential generative
adversarial networks for symbolic music generation and
accompaniment” (288 cites / 957 GitHub stars)
 [ISMIR 2019 tutorial] Generating Music with
GANs—An Overview and Case Studies (3 hours)
(+) The model learns on its own what is real / fake,
without human imposed rules
(−) But, only for 8 bars long (around 16 seconds),
and lacks expression
44
45
Music understanding
technology: pitch, beat,
chord analysis
Music generation
technology: compose
original music scores
 [ISMIR-LBD 2019] “Learning to generate Jazz
and Pop piano music from audio via music
information retrieval techniques”
Know Music Better before Learning to
Generate Them
Second Attempt: Transformer + MIR Tech
47
 [ACM MM 2020] “Pop Music Transformer: Beat-based modeling
and generation of expressive Pop piano compositions” (53 cites)
(+) Learn from audio files, not MIDIs → generate expressive music
(+) Consider the “beats / downbeats” in music → rhythmically better
(+) Outperform Music Transformer model for pop music [ICLR 2019]
prompt generated music
Google’s
model
our model
Third Attempt: Full-song Generation by A
Multi-output Transformer
49
 [AAAI 2021] “Compound Word Transformer: Learning to compose
full-song music over dynamic directed hypergraphs” (12 cites)
 [ICML 2021] “Relative positional encoding for Transformers with
linear complexity” (top 9% paper)
• Model the fact that, unlike text in NLP, a musical note is
associated with multiple attributes (e.g., pitch, duration, velocity)
(+) Able to generate expressive piano music for 2-4 minutes long
(+) Highly efficient: the model can be trained in a day on a single GPU
(2080 Ti, 11GB mem)
Demo:
KaraSinger
76
 [arXiv] “KaraSinger: Score-free singing voice synthesis
with VQ-VAE using Mel-spectrograms”
https://jerrygood0703.github.io/KaraSinger/
https://soundcloud.com/affige/karasinger?si=88affc
373ddd4cce91bdc04f68be314d
2021 (karasinger)
KaraSinger
78
 [arXiv] “KaraSinger: Score-free singing voice synthesis
with VQ-VAE using Mel-spectrograms”
KaraSinger
80
 [arXiv] “KaraSinger: Score-free singing voice synthesis
with VQ-VAE using Mel-spectrograms”
Research at “Yating” at Taiwan AI Labs
 Generate scores
• Piano music
• Guitar music
• Piano/violin/bass/drum
 Generate sounds
• MIDI-free & lyrics-free sing to piano
• MIDI-free & lyrics-free improvisational singing
• MIDI-free & lyrics-cond. improvisational singing
• Guitar
• Violin, flute
• EDM Loops
84
https://soundcloud.com/affige/pop-music-
transformer-vocal-
2021?si=6bbda3390e7e413d87ef1d97071f3874
2021 (pop music transformer + vocal + human editing)
Sing to Piano
85
Music understanding
technology: pitch, beat,
chord analysis
piano vocal
Singing voice
generation
• Given piano, generate singing voice
• The piano and singing voice are aligned, so that they
can be played together
Exploring New Experiences with Music
86
(https://everylittled.com/article/143608)
文策院「創意內容大會」
(TCCF)2020
#01 未來內容|雅婷音樂
人機互動體驗
“有時腦袋會浮現意外靈感,
而信手捻來的音樂旋律,往往
只有片段,且如流星閃爍一般,
稍縱即逝。在未來內容展示體
驗展區,有一款「雅婷音樂人
機互動體驗」,能幫助音樂創
作人自動生成音樂,讓靈光乍
現延伸為完整的音樂。是的,
雅婷不只會生成逐字稿,如今
也可以創作音樂了。”
“這款由Taiwan AI Labs研究製作的雅婷音樂,採用了人工智慧技術,現場觀眾只要在鋼
琴上彈奏四小節的音符,AI就能即時生成圖像和人聲,體驗與雅婷共譜出一小段樂章。
運用多元科技而成的音樂體驗模式,也將突破音樂創作思維,帶來更多創新可能性。”
https://arts.yating.tw/
87

20211026 taicca 2 music generation

  • 1.
    http://mac.citi.sinica.edu.tw/~yang/ yhyang@ailabs.tw yang@citi.sinica.edu.tw Yi-Hsuan Yang Ph.D.1,2 1 Taiwan AI Labs 2 Research Center for IT Innovation, Academia Sinica October 26, 2021
  • 2.
    Well-Established Music Technology: “Making”Sounds 2 • Music synthesizers that make realistic sounds (e.g,. electric piano) and new sounds (e.g., electric guitar) • Based on digital signal processing Sources: https://www.wardbrodt.com/blog/history-of-the-electronic-keyboard-infographic-madison-wisconsin https://www.musicnexo.com/blog/en/history-of-the-electric-guitar-eternal-youth/ https://freesound.org/people/karolist/sounds/370934/
  • 3.
  • 4.
    Well-Established Music Technology: “Altering”Sounds 4 (image from https://www.youtube.com/watch?v=u2g2tH0yb_Q)
  • 5.
    Well-Established Music Technology: “Mixing”Sounds 5 (image from https://midination.com/daw/best-daw-software/)
  • 6.
    6 velocity, duration, articulation, playing technique,etc timbre “what” to play “how” to play use “what” to play Music: Multiple aspects
  • 7.
    7 velocity, duration, articulation, playing technique,etc timbre “what” to play “how” to play use “what” to play Well-accepted good computer tools Human privilege
  • 8.
    Emerging Music Technology: MakingMusic 8 • Computers that can understand (existing) music and create new music compositions & performances • Based on machine learning • Example: “Bach Doodle” by Google in 2019 (https://www.google.com/doodles/celebrating-johann-sebastian-bach) https://youtu.be/gsUV0mGEGaY (image is from Google Magenta’s website)
  • 9.
    Emerging Music Technology: MakingMusic 9 (image from https://www.youtube.com/watch?v=u2g2tH0yb_Q)
  • 10.
  • 11.
    AI Music: Examples 11 OriginalJazz Lazy By Sony CSL
  • 12.
  • 13.
  • 14.
    AI Music: Examples https://magenta.tensorflow.org/studio/ 14 “Continue,”“Generate 4 bars,” “Drumify,” “Interpolate,” “Groove” (image is from Google Magenta’s website)
  • 15.
  • 16.
  • 17.
  • 18.
    Ethical Concerns 18 (comments ofAIVA’s “I am AI” album)
  • 19.
    Ethical Concerns 19 • “AsAI begins to reshape how music is made, our legal systems are going to be confronted with some messy questions regarding authorship.” • “Do AI algorithms create their own work, or is it the humans behind them?” • “What happens if AI software trained solely on Beyoncé creates a track that sounds just like her?”
  • 20.
  • 21.
    22 velocity, duration, articulation, playing technique,etc timbre “what” to play “how” to play use “what” to play Well-accepted good computer tools Human privilege
  • 22.
    How to BuildSuch an AI? 23 https://www.eslite.com/product/1001117242682044820004
  • 23.
    How to BuildSuch an AI? 24 https://www.unite.ai/artificial-general-intelligence-agi/
  • 24.
    Actually… 25 • Based onprobabilities • Given “xt-1” (the last one), predict “xt” (the current one) https://towardsdatascience.com/markov-chain-for-music- generation-932ea8a88305
  • 25.
    Actually… 26 • Based onprobabilities • Given “xt-1” (the last one), predict “xt” (the current one) • Given “x1, …, xt-2, xt-1” (all the past), predict “xt” (the current one) deep neural networks
  • 26.
    Solving Math • Apiece of music is considered as a sequence of “events” • A model is trained to predict “what comes next”, given the history of previous events P(xt | xt-1, xt-2, xt-3, …), “As we listen to melodies, our brain also guesses what's next” https://bigthink.com/surprising-science/music-brain-predict 27
  • 27.
    Solving Math • Apiece of music is considered as a sequence of “events” • A model is trained to predict “what comes next”, given the history of previous events P(xt | xt-1, xt-2, xt-3, …), xt S 28 a finite collection of possible events
  • 28.
    Core Questions • Apiece of music is considered as a sequence of “events” • A model is trained to predict “what comes next”, given the history of previous events P(xt | xt-1, xt-2, xt-3, …), xt S 1. How to turn music into events “S” 2. How to solve the equation 29
  • 29.
  • 30.
    RNNs vs. Transformers •RNNs summarize the history with a single latent vector • Transformers represent each of the previous event by a separate latent vector 31 (images are from the internet)
  • 31.
    Representing Music asEvents 32 • Melody is easier • But, how about…
  • 32.
    • A musicpiece is viewed as a token sequence • A musical note is described by three tokens Representing Music as Events 33 pitch duration velocity Music Transformer: Generating Music with Long-Term Structure, ICLR 2019
  • 33.
    • How torepresent “time” in music using a finite set of events? Representing Music as Events 34 pitch duration velocity Music Transformer: Generating Music with Long-Term Structure, ICLR 2019
  • 34.
    • Add oneadditional token to mark “∆T” (interval time) Representing Music as Events 35 ∆T =0 pitch duration velocity a b
  • 35.
    • And, oneadditional token to mark “∆T” (interval time) Representing Music as Events 36 pitch duration velocity ∆T =50ms a c
  • 36.
    A Problem • “∆T”does not work well for pop music 1. The generated music lacks stable rhythmic structure 2. The model has hard time to “count the beats” 3. Errors in “∆T” accumulate 37 ∆T =50ms a c
  • 37.
    Our Proposal • Markthe bar lines • Indicate the position of a note within a bar 38 bar subbeat pitch duration velocity ∆T
  • 38.
    Our Proposal • Markthe bar lines • Indicate the position of a note within a bar 39 bar subbeat pitch duration velocity the same ‘subbeat‘ token a b
  • 39.
    Token Engineering • Theinput representation of music matters 40
  • 40.
    Representing Multi-instrument Music •Add instrument-related tokens 41 (image from OpenAI’s blog on their MuseNet model)
  • 41.
    Research at “Yating”at Taiwan AI Labs  Generate scores • Piano music • Guitar music • Piano/violin/bass/drum  Generate sounds • MIDI-free & lyrics-free sing to piano • MIDI-free & lyrics-free improvisational singing • MIDI-free & lyrics-cond. improvisational singing • Guitar • Violin, flute • EDM Loops 42 2017 (midinet) 2021 (pop music transformer) https://soundcloud.com/affige/midinet- 2017?si=40bc1caf82f143f4ae237c9702e3e2e7 https://soundcloud.com/affige/pop-music-transformer- 2021?si=fe3ee3b2804d496b881f319cd4c10535
  • 42.
    Research at “Yating”at Taiwan AI Labs • Automatic generation of pop/jazz piano performances ‒ Unconditioned generation [MM 2020, AAAI 2021, ICML 2021] ‒ Lead-sheet conditioned generation [AAAI 2021] ‒ Emotion-conditioned generation [ISMIR 2021a] (arxiv: 2108.01374) ‒ Infilling [ISMIR 2021b] (arxiv: 2108.05064) ‒ Style transfer [arXiv 2021a] (arxiv: 2105.04090) • Automatic generation of singing voices ‒ Unconditioned generation ‒ Lyrics conditioned generation • Automatic generation of audio from other instruments ‒ Guitar, violin, flute, EDM loops • Music understanding technology [arXiv 2021b] (arxiv: 2107.05223) • User Interaction • Image Generation 43
  • 43.
    First Attempt: ConvolutionalGenerative Adversarial Network (GAN)  [ISMIR 2017] “MidiNet: A convolutional generative adversarial network for symbolic-domain music generation” (305 cites)  [AAAI 2018] “MuseGAN: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment” (288 cites / 957 GitHub stars)  [ISMIR 2019 tutorial] Generating Music with GANs—An Overview and Case Studies (3 hours) (+) The model learns on its own what is real / fake, without human imposed rules (−) But, only for 8 bars long (around 16 seconds), and lacks expression 44
  • 44.
    45 Music understanding technology: pitch,beat, chord analysis Music generation technology: compose original music scores  [ISMIR-LBD 2019] “Learning to generate Jazz and Pop piano music from audio via music information retrieval techniques” Know Music Better before Learning to Generate Them
  • 45.
    Second Attempt: Transformer+ MIR Tech 47  [ACM MM 2020] “Pop Music Transformer: Beat-based modeling and generation of expressive Pop piano compositions” (53 cites) (+) Learn from audio files, not MIDIs → generate expressive music (+) Consider the “beats / downbeats” in music → rhythmically better (+) Outperform Music Transformer model for pop music [ICLR 2019] prompt generated music Google’s model our model
  • 46.
    Third Attempt: Full-songGeneration by A Multi-output Transformer 49  [AAAI 2021] “Compound Word Transformer: Learning to compose full-song music over dynamic directed hypergraphs” (12 cites)  [ICML 2021] “Relative positional encoding for Transformers with linear complexity” (top 9% paper) • Model the fact that, unlike text in NLP, a musical note is associated with multiple attributes (e.g., pitch, duration, velocity) (+) Able to generate expressive piano music for 2-4 minutes long (+) Highly efficient: the model can be trained in a day on a single GPU (2080 Ti, 11GB mem) Demo:
  • 47.
    KaraSinger 76  [arXiv] “KaraSinger:Score-free singing voice synthesis with VQ-VAE using Mel-spectrograms” https://jerrygood0703.github.io/KaraSinger/ https://soundcloud.com/affige/karasinger?si=88affc 373ddd4cce91bdc04f68be314d 2021 (karasinger)
  • 48.
    KaraSinger 78  [arXiv] “KaraSinger:Score-free singing voice synthesis with VQ-VAE using Mel-spectrograms”
  • 49.
    KaraSinger 80  [arXiv] “KaraSinger:Score-free singing voice synthesis with VQ-VAE using Mel-spectrograms”
  • 50.
    Research at “Yating”at Taiwan AI Labs  Generate scores • Piano music • Guitar music • Piano/violin/bass/drum  Generate sounds • MIDI-free & lyrics-free sing to piano • MIDI-free & lyrics-free improvisational singing • MIDI-free & lyrics-cond. improvisational singing • Guitar • Violin, flute • EDM Loops 84 https://soundcloud.com/affige/pop-music- transformer-vocal- 2021?si=6bbda3390e7e413d87ef1d97071f3874 2021 (pop music transformer + vocal + human editing)
  • 51.
    Sing to Piano 85 Musicunderstanding technology: pitch, beat, chord analysis piano vocal Singing voice generation • Given piano, generate singing voice • The piano and singing voice are aligned, so that they can be played together
  • 52.
    Exploring New Experienceswith Music 86 (https://everylittled.com/article/143608) 文策院「創意內容大會」 (TCCF)2020 #01 未來內容|雅婷音樂 人機互動體驗 “有時腦袋會浮現意外靈感, 而信手捻來的音樂旋律,往往 只有片段,且如流星閃爍一般, 稍縱即逝。在未來內容展示體 驗展區,有一款「雅婷音樂人 機互動體驗」,能幫助音樂創 作人自動生成音樂,讓靈光乍 現延伸為完整的音樂。是的, 雅婷不只會生成逐字稿,如今 也可以創作音樂了。” “這款由Taiwan AI Labs研究製作的雅婷音樂,採用了人工智慧技術,現場觀眾只要在鋼 琴上彈奏四小節的音符,AI就能即時生成圖像和人聲,體驗與雅婷共譜出一小段樂章。 運用多元科技而成的音樂體驗模式,也將突破音樂創作思維,帶來更多創新可能性。”
  • 53.