3. 3
TTS & MIDI-to-Audio Synthesis
0 100 200 300 400 500
Frame index
256
513
Frequency
bins
Acoustic
model
Waveform
model
Wav
MIDI
Piano roll
Acoustic
features
MIDI
API
Acoustic
model
Waveform
model
Context
vectors
Acoustic
features
Front
end
Wav
.txt
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
or
Pitch is crucial for music
5. 5
Acoustic
model
Waveform
model
Wav
MIDI
Piano roll
Acoustic
features
MIDI
API
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
or
TTS & MIDI-to-Audio Synthesis
0 100 200 300 400 500
Frame index
256
513
Frequency
bins
Acoustic
model
Waveform
model
Context
vectors
Acoustic
features
Front
end
Wav
.txt
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
1
0
0
Phone ID (one-hot)
Syllable ID (one-hot)
#. Phone
#. Syllables
<latexit sha1_base64="YYtcP5HcZDv4wo3N4OOgsso9tf8=">AAAC6nicbVLLahsxFJWnr8R9Je0ym6Gm0E3MTElploFuugouxInBMxiN5toW1mOQ7jQ2Qj+RXUmW/Zz+QP+m8iPQmeSC0OHce+5LKirBLSbJ30705Omz5y/29rsvX71+8/bg8N2l1bVhMGRaaDMqqAXBFQyRo4BRZYDKQsBVsfi29l/9BGO5Vhe4qiCXdKb4lDOKgRplhXVLP1GTg17STzYWPwTpDvTIzgaTw86frNSslqCQCWrtOE0qzB01yJkA381qCxVlCzqDcYCKSrC52zTs44+BKeOpNuEojDfs/wpHpbUrWYRISXFu2741+ZhvXOP0NHdcVTWCYttC01rEqOP19HHJDTAUqwAoMzz0GrM5NZRh2FE3U3DNtJRUlS5sxo/TPNxalOtetHC91DcG246DhXiEbVIzQ6s5Z0vfbRYZXBhOhXcZmwNbSGoWvhUAymgRIo5bjvN7JcISNxWdgbKRqa24T9WWHHsf3j9tv/ZDcPm5n37pJz9Oemfp7ifskSPygXwiKflKzsh3MiBDwoggN+SW3EUiuol+Rbfb0Kiz07wnDYt+/wP+EPS4</latexit>
xn
<latexit sha1_base64="YYtcP5HcZDv4wo3N4OOgsso9tf8=">AAAC6nicbVLLahsxFJWnr8R9Je0ym6Gm0E3MTElploFuugouxInBMxiN5toW1mOQ7jQ2Qj+RXUmW/Zz+QP+m8iPQmeSC0OHce+5LKirBLSbJ30705Omz5y/29rsvX71+8/bg8N2l1bVhMGRaaDMqqAXBFQyRo4BRZYDKQsBVsfi29l/9BGO5Vhe4qiCXdKb4lDOKgRplhXVLP1GTg17STzYWPwTpDvTIzgaTw86frNSslqCQCWrtOE0qzB01yJkA381qCxVlCzqDcYCKSrC52zTs44+BKeOpNuEojDfs/wpHpbUrWYRISXFu2741+ZhvXOP0NHdcVTWCYttC01rEqOP19HHJDTAUqwAoMzz0GrM5NZRh2FE3U3DNtJRUlS5sxo/TPNxalOtetHC91DcG246DhXiEbVIzQ6s5Z0vfbRYZXBhOhXcZmwNbSGoWvhUAymgRIo5bjvN7JcISNxWdgbKRqa24T9WWHHsf3j9tv/ZDcPm5n37pJz9Oemfp7ifskSPygXwiKflKzsh3MiBDwoggN+SW3EUiuol+Rbfb0Kiz07wnDYt+/wP+EPS4</latexit>
xn
A4, velocity 0.4
0.9
0
D5, velocity 0.4
…
0.4
…
0
<latexit sha1_base64="YYtcP5HcZDv4wo3N4OOgsso9tf8=">AAAC6nicbVLLahsxFJWnr8R9Je0ym6Gm0E3MTElploFuugouxInBMxiN5toW1mOQ7jQ2Qj+RXUmW/Zz+QP+m8iPQmeSC0OHce+5LKirBLSbJ30705Omz5y/29rsvX71+8/bg8N2l1bVhMGRaaDMqqAXBFQyRo4BRZYDKQsBVsfi29l/9BGO5Vhe4qiCXdKb4lDOKgRplhXVLP1GTg17STzYWPwTpDvTIzgaTw86frNSslqCQCWrtOE0qzB01yJkA381qCxVlCzqDcYCKSrC52zTs44+BKeOpNuEojDfs/wpHpbUrWYRISXFu2741+ZhvXOP0NHdcVTWCYttC01rEqOP19HHJDTAUqwAoMzz0GrM5NZRh2FE3U3DNtJRUlS5sxo/TPNxalOtetHC91DcG246DhXiEbVIzQ6s5Z0vfbRYZXBhOhXcZmwNbSGoWvhUAymgRIo5bjvN7JcISNxWdgbKRqa24T9WWHHsf3j9tv/ZDcPm5n37pJz9Oemfp7ifskSPygXwiKflKzsh3MiBDwoggN+SW3EUiuol+Rbfb0Kiz07wnDYt+/wP+EPS4</latexit>
xn
<latexit sha1_base64="YYtcP5HcZDv4wo3N4OOgsso9tf8=">AAAC6nicbVLLahsxFJWnr8R9Je0ym6Gm0E3MTElploFuugouxInBMxiN5toW1mOQ7jQ2Qj+RXUmW/Zz+QP+m8iPQmeSC0OHce+5LKirBLSbJ30705Omz5y/29rsvX71+8/bg8N2l1bVhMGRaaDMqqAXBFQyRo4BRZYDKQsBVsfi29l/9BGO5Vhe4qiCXdKb4lDOKgRplhXVLP1GTg17STzYWPwTpDvTIzgaTw86frNSslqCQCWrtOE0qzB01yJkA381qCxVlCzqDcYCKSrC52zTs44+BKeOpNuEojDfs/wpHpbUrWYRISXFu2741+ZhvXOP0NHdcVTWCYttC01rEqOP19HHJDTAUqwAoMzz0GrM5NZRh2FE3U3DNtJRUlS5sxo/TPNxalOtetHC91DcG246DhXiEbVIzQ6s5Z0vfbRYZXBhOhXcZmwNbSGoWvhUAymgRIo5bjvN7JcISNxWdgbKRqa24T9WWHHsf3j9tv/ZDcPm5n37pJz9Oemfp7ifskSPygXwiKflKzsh3MiBDwoggN+SW3EUiuol+Rbfb0Kiz07wnDYt+/wP+EPS4</latexit>
xn
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
6. 6
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
TTS & MIDI-to-Audio Synthesis
0 100 200 300 400 500
Frame index
256
513
Frequency
bins
Acoustic
model
Waveform
model
Wav
MIDI
Piano roll
Acoustic
features
MIDI
API
Acoustic
model
Waveform
model
Context
vectors
Acoustic
features
Front
end
Wav
.txt
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
Not completely aligned
AI
performer
<latexit sha1_base64="Ce+Vsx7LGNLmrE4StAaBzOgsSt0=">AAADInicbVLLitswFFXc1zR9ZdplN6Kh0M0Eu7S0dFEGuummQwrJzEBswrWsxCJ6GEnuJBj9zXzN0E2Zbgr9mMpOBmpnLggdzrlHV/dKacGZsWH4pxfcuXvv/oODh/1Hj588fTY4fH5qVKkJnRLFlT5PwVDOJJ1aZjk9LzQFkXJ6lq6+1PrZD6oNU3JiNwVNBCwlWzAC1lPzwec4NdXazavo0zeHY82WuQWt1QWuBWiEk31BNcLEzQfDcBQ2gfdBtANDtIvx/LD3M84UKQWVlnAwZhaFhU0q0JYRTl0/Lg0tgKxgSWceShDUJFXTqMOvPZPhhdJ+SYsb9n9HBcKYjUh9pgCbm65Wk7dps9IuPiYVk0VpqSTbQouSY6twPTWcMU2J5RsPgGjm74pJDhqI9bPtx5JeECUEyKzy03GzKPG74ll9F8WrYeRajW3bsSm/hW1TSw1Fzsja9dtFxhPNgLsqJjklKwF65ToJVGrFfcZRRzi5cVq6tk3FStOsdVLXcXNU13Lk6vePuq+9D07fjqL3o/D7u+FxtPsJB+gleoXeoAh9QMfoKxqjKSLoEl2ha/Q7uAyugl/B9TY16O08L1Argr//ADF1C2g=</latexit>
x1:M ! a1:N ! o1:T
<latexit sha1_base64="uTpzxJX9ikqk62YuUY3zkv6CAV4=">AAADInicdVJda9swFFW8ry77SrfHvYiFwV4a7LGx0YdR2MueSgZJW4hNuJaVWESWjCS3CUL/pr+m7GV0L4P9mMlOCrPbXRA6nHOPru6V0pIzbcLwdy+4d//Bw0d7j/tPnj57/mKw//JEy0oROiWSS3WWgqacCTo1zHB6VioKRcrpabr6Wuun51RpJsXEbEqaFLAUbMEIGE/NB1/iVNu1m9vo8NjhWLFlbkApeYFrAf4nyEaYuPlgGI7CJvBtEO3AEO1iPN/v/YgzSaqCCkM4aD2LwtIkFpRhhFPXjytNSyArWNKZhwIKqhPbNOrwW89keCGVX8Lghv3XYaHQelOkPrMAk+uuVpN3abPKLD4nlomyMlSQbaFFxbGRuJ4azpiixPCNB0AU83fFJAcFxPjZ9mNBL4gsChCZ9dNxsyjxu+RZfRfJ7TByrca27ZiU38G2qaWCMmdk7frtIuOJYsCdjUlOyaoAtXKdBCqU5D7joCMc3zgNXZumolU0a53Uddwc1bUcuPr9o+5r3wYn70fRx1H4/cPwKNr9hD30Gr1B71CEPqEj9A2N0RQRdImu0DX6FVwGV8HP4HqbGvR2nleoFcGfvzQnC2k=</latexit>
x1:N ! a1:N ! o1:T
Aligned
Silent
instrument
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
or
7. 7
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
TTS & MIDI-to-Audio Synthesis
0 100 200 300 400 500
Frame index
256
513
Frequency
bins
Acoustic
model
Waveform
model
Wav
MIDI
Piano roll
Acoustic
features
MIDI
API
Acoustic
model
Waveform
model
Context
vectors
Acoustic
features
Front
end
Wav
.txt
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
Apply TTS techniques
to MIDI-to-Audio
8. 8
Methods
Wang, Y. et al. Tacotron: Towards End-to-End Speech Synthesis. in Proc. Interspeech 4006–4010 (2017).
Yasuda, Y., Wang, X., Takaki, S. & Yamagishi, J. Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent
language. in Proc. ICASSP 6905–6909 (2019).
q Acoustic model
1. TTS model: Tacotron (Wang 2017, Yasuda 2019)
Acoustic
model
Waveform
model
Wav
MIDI
API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
Dense
CBH-
LSTM
Self
Attention
Attention
RNN
Additive
Attention
Forward
Attention
Concat
Concat
Stop
Token
Pre-Net
Decoder
RNN
Self
Attention
Sigmoid Linear Waveform
model
Post-Net
Wav
⊕
⊕
⊕
Encoder Decoder
• Taco2: 800-frame
segments; 4x
downsampling for
better alignments
• Taco3: Warm-
started from taco2;
input current piano-
roll frame at the
decoder pre-net
• Taco4: Warm-
started from taco2;
no downsampling or
piano-roll input to
pre-net
9. 9
Methods
Wang, B. & Yang, Y.-H. PerformanceNet: Score-to-audio music generation with multi-band convolutional residual network. in Proceedings of the
AAAI Conference on Artificial Intelligence vol. 33 1174–1181 (2019).
q Acoustic model
2. Reference model from music field: PerformanceNet (Wang 2019)
Acoustic
model
Waveform
model
Wav
MIDI
API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
Adopted from figure 3 in (Wang 2019)
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
10. 10
Methods
Librosa midi_to_hz: https://librosa.org/doc/0.7.0/generated/librosa.core.midi_to_hz.html
q Acoustic features
1. Mel-spectrogram
2. MIDI-filterbank-spectrogram
Acoustic
model
Waveform
model
Wav
MIDI
API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
120 233 346 460
Frame index
16
32
48
64
80
96
112
128
Pinao
roll
index
(1-128)
Piano roll
Mel filter bank MIDI filter bank
E4
E6
E8
Frequency (0-12 kHz)
7
29
61
Mel
filter
index
(1-80)
E4
E6
E8
Frequency (0-12 kHz)
64
88
112
MIDI
filter
index
(1-128)
120 233 346 460
Frame index
7
29
61
Dimension
index
(1-80)
Mel-spectrum
120 233 346 460
Frame index
64
88
112
Dimension
index
(1-128)
MIDI-fb-spectrum
Mel-spec. MIDI-fb-spec.
<latexit sha1_base64="6X1VLiCNlG3P2EJtZGp3iOqzN64=">AAACl3icbZFbaxNBFMcn663GW6pPIsJgEOqDcTfEWh/EgiB9bMG0hewazszOJkPmssycbQ3LPvlpfNVP47dxcnlw2x4Y5s/vnMO5sVJJj3H8txPdun3n7r2d+90HDx89ftLbfXrqbeW4GHOrrDtn4IWSRoxRohLnpROgmRJnbPFl5T+7EM5La77hshSZhpmRheSAAU17Lwv6iQ6/13uLt/sf37yjybChKUotPB2N4mmvHw/itdHrItmKPtna8XS3w9Lc8koLg1yB95MkLjGrwaHkSjTdtPKiBL6AmZgEaSBUyur1HA19HUhOC+vCM0jX9P+MGrT3S81CpAac+6u+FbzJN6mwOMhqacoKheGbQkWlKFq6WgrNpRMc1TII4E6GXimfgwOOYXXd1IhLbrUGk9cp880kycJvVb7qxaq6nzStwTbjIFM30DaaOSjnkv9oU2btAiEUalFdKZTOXjbdcJXk6g2ui9PhIHk/iE9G/cOD7X12yAvyiuyRhHwgh+SIHJMx4eQn+UV+kz/R8+hz9DU62oRGnW3OM9Ky6OQfqXrMdw==</latexit>
f = 2(k 69)/12
⇥ 440
<latexit sha1_base64="86+t9jn0UkxzHxA2if7/PbZ7/JQ=">AAACeXicbZHNThtBDMcn2xbSUGhoj71sGyEhDtEuLSrHSL30CFIDSMkKeWa9ySjzsZrxlkarfYJe24fjWbh08nHoApas+etnW7bHvFTSU5Lcd6IXL1/t7HZf9/be7B+87R++u/K2cgLHwirrbjh4VNLgmCQpvCkdguYKr/ni2yp+/ROdl9b8oGWJmYaZkYUUQAFdFrf9QTJM1hY/FelWDNjWLm4PO3yaW1FpNCQUeD9Jk5KyGhxJobDpTSuPJYgFzHASpAGNPqvXkzbxUSB5XFgX3FC8pv9X1KC9X2oeMjXQ3D+OreBzsUlFxXlWS1NWhEZsGhWVisnGq7XjXDoUpJZBgHAyzBqLOTgQFD6nNzV4J6zWYPJ6yn0zSbPwWpWvZrGqHqRNa7HNOsTVM7SNZg7KuRS/2pRbuyAIjVpUV4qks3dNL1wlfXyDp+LqdJieDZPLL4PR+fY+XfaBfWLHLGVf2Yh9ZxdszARD9pv9YX87D9HH6Dg62aRGnW3Ne9ay6PM/WXrEyA==</latexit>
f
<latexit sha1_base64="JbTkpy7RgEzAoI7aVLnTtzC7Ygk=">AAACeXicbZHNThtBDMcn2xZoKBDaYy9LIyTEIdqFonJE4tIjSA0gJSvkmXWSUeZjNeMtjVb7BFzh4XgWLp18HFjAkjV//WzL9pgXSnpKkqdW9OHjp7X1jc/tzS9b2zud3a9X3pZOYF9YZd0NB49KGuyTJIU3hUPQXOE1n57P49d/0XlpzR+aFZhpGBs5kgIooMvpbaeb9JKFxW9FuhJdtrKL290WH+ZWlBoNCQXeD9KkoKwCR1IorNvD0mMBYgpjHARpQKPPqsWkdbwfSB6PrAtuKF7QlxUVaO9nmodMDTTxr2Nz+F5sUNLoNKukKUpCI5aNRqWKycbzteNcOhSkZkGAcDLMGosJOBAUPqc9NHgnrNZg8mrIfT1Is/Balc9nsarqpnVjseU6xNU7tInGDoqJFP+alFs7JQiNGlSXiqSzd3U7XCV9fYO34uqol570ksuf3bPT1X022Hf2gx2wlP1iZ+w3u2B9Jhiye/bAHlvP0V50EB0uU6PWquYba1h0/B9j48TN</latexit>
k
Librosa.midi_to_hz
11. 11
Methods
Blank lines in MIDI-fb-spec. due to frequency resolution of FFT (see appendix)
q Acoustic features
1. Mel-spectrogram
2. MIDI-filterbank-spectrogram
Acoustic
model
Waveform
model
Wav
MIDI
API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
120 233 346 460
Frame index
16
32
48
64
80
96
112
128
Pinao
roll
index
(1-128)
Piano roll
Mel filter bank MIDI filter bank
E4
E6
E8
Frequency (0-12 kHz)
7
29
61
Mel
filter
index
(1-80)
E4
E6
E8
Frequency (0-12 kHz)
64
88
112
MIDI
filter
index
(1-128)
120 233 346 460
Frame index
7
29
61
Dimension
index
(1-80)
Mel-spectrum
120 233 346 460
Frame index
64
88
112
Dimension
index
(1-128)
MIDI-fb-spectrum
Mel-spec. MIDI-fb-spec.
12. 12
Methods
Zhao, Y., Wang, X., Juvela, L. & Yamagishi, J. Transferring neural speech waveform synthesizers to musical instrument sounds generation. in
Proc. ICASSP 6269–6273 (IEEE, 2020). doi:10.1109/ICASSP40776.2020.9053047
q Waveform model
§ Based on music neural source-filter (NSF) model (Zhao 2020) but
• No harmonic-plus-noise structure
Acoustic
model
Waveform
model
Wav
MIDI
API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
Bi-LSTM 1D CNN Up-sampling
Acoustic
features
Excitation
signal
Block
5
Wav
…
FC
Dilated
1D conv
Dilated
1D conv FC
Block 1
…
Condition module
Neural filter module
13. 13
Experiments
Hawthorne, C. et al. Enabling factorized piano music modeling and generation with the MAESTRO dataset. in Proc. ICLR (2018).
https://piano-e-competition.com/
q Database: MAESTRO v2.0 (Hawthorne 2018)
§ Real piano performances in International Piano-e-Competition
§ MIDI was recorded simultaneously during performance
• Aligned audio & piano roll
§ For experiments:
• Follow official data split
• 24kHz, 16 bits PCM
Data split of MAESTRO
From https://magenta.tensorflow.org/datasets/maestro#v200
21. 21
Messages
q TTS & MIDI-to-audio
§ Techniques can be shared: acoustic model, waveform model
§ Performance bottleneck on acoustic model (and waveform model)
q On waveform modeling
§ Physical-model performs well but lacks reverberation effect
§ Sample-based model replies on the sample database
§ Non-AR waveform model is OK in copy-synthesis
• Reverberation is captured
• Noise excitation is OK
22. 22
Messages
q On acoustic model
§ Obtaining good alignments for longer input sequences is challenging
§ Inputting the piano-roll frame to the decoder prenet helps improve alignments
• Acceptable for perfectly-aligned performance-MIDI
• Have to consider other strategies for non-aligned score-MIDI
27. 28
Appendix
q On MIDI filterbank
Natural
Mel-
spectrogram
to wave
MIDI-
centered
filter-bank
CQT
Short
clip 1
Short
clip 2
Short
clip 3
Short
clip 4
28. 29
Appendix
https://www.fluidsynth.org/
https://www.modartt.com/pianoteq
q Models in comparison
<latexit sha1_base64="QvEzsRPQOsJO2uPs95Rt4gD1HG8=">AAAM5XiclVbdbuNEFPYuC9mGAFu45MaiYrUg2bId56d3i1i6RNotQaVppLqqxpNJYtX2BM9428jyI3CHkOCGV+EW3oC34cxMksY/SYVTy9PxOd985ztzzthfhAHjlvXvo8fvPXn/g8bTg+aHrY8+/uTZ4acjRtMEk3NMQ5qMfcRIGMTknAc8JONFQlDkh+TCv/lWvL94RxIW0PgnvlyQqwjN4mAaYMRh6vqw0fZ8MgvijCM/DVGSZ3h15U2P00WShqTpRWnIg4TeZk6efZ1nZ0vGSaQPXuX6c7380mNzmnDGEb7JvsE0ZTzAuufpEZ2QMP8/HlOCeJqQB32+u8MBN4UHC2YxeniRC/SOmHWcQM80ioUHzrNhwPFcjwIWIRg8hPn2hzOB+CIiKP4qzz2v6WGRlKxr9PKmXnM9lz/xB3dMOYEHBsCJWAnco2Ai1T8VKqAQZo0dt2tabsHFY0s2HozKcbl5FkL2ULQIiSF2zUR/O3g1MDg1UDoJqM7olN8ikFyw6piOJZ5ds9cTFNtmtytXEeA7sIfzJYO9FRpS2X3ortl3FHpHrJK5ptPJi0Egn8kYRIgwGRhTH0YMRIXH6dmJFE5poPj1exJA+A0qfjEN2Max6Nfb8ts4ktBgC2wWV9z2c8yes/Greu1Zr29XsjVSK3uc3HHOoRoxdfJdcR+AbiBY186Vgm1X0jnurdMzeghsm9yBzAbkVmF1JUW7v0m1XoJq7+aloHp9BeX2FdTxGmow3g9VZdU/Xu2RdglqXInQ3SsXQK0DPFZi3Qc4KEfoPkQLlFYROgqrcy88TAxPTgnfrZFUqWN2ZGg903ZVaNZGpVEtRoHFGkRJ3TMdlTXrnsg2htyUpLyXNyAq9cBERGWLulwzGQx2wJTIKC7uiktHidLeypWUeA20BhEl8fqN5AEIHdNtq7bYheTIKdvsOpVK2VSoUS5NSEx71VNcS1Vof5Plba9SaYKfu+p0Vl+6dRV3n3JOI7UyiSebM/L62ZFlWvLSqwN7NTjSVtfw+vDJ796E4jQiMcchYuzSthb8KkMJ9M+QwFGbMrKAYwTNyCUMYxQRdpXJ4z3Xv0xFp57SBO6Y63J22yNDEWPLyAdLOKrmrPxOTNa9u0z5tH+VBfEi5STGaqFpGuqc6uJbQZ8ECcE8XMIA4SQQxzKeowRhDl8UTS8mt5hGEQJlPJ/ll/YVPGk4EVxomB3ZeSEwFQ73w5rZ4tQsQYt5gO+Ksz6lN5ABVpxdH8l5s0hINGU4oVdVDf8ZYv9C5gWpsmHZcurvtKwBhU1YZzqqgtZYiq15byjLHZY2YjatsyybAuBO0y3bxTQmfJcAsuVUbOslEJ2lDrY+sFGVwk4RRtt85Ym1j8Og3nqHvuOibXsPcq1pPey4RMHdB1vm6+4FruZO9Ezj9Zta662MnIRpAPUXQ8HXWG4ZDgMkPjl/zkUzs8utqzoYOabdMa0f3aOX/VVbe6p9rn2hvdBsrae91L7Xhtq5hht/NP5q/N34pzVr/dL6tfWbMn38aOXzmVa4Wn/+B5czF8c=</latexit>
System ID
Acoustic
model
Acoustic
feature
Excit.
signal
Wave.
model
Pitch mismatch MOS
(mean)
note chord
Natural - - - - - - 4.04
Fluidsynth Sample-based MIDI-to-audio software 5.20 6.77 3.66
Pianoteq Physical-model MIDI-to-audio software 4.82 6.50 4.25
abs-mfb-sin - midi-fb sine NSF - - 3.87
abs-mfb-noi - midi-fb noise NSF - - 3.77
abs-mel-sin - mel-spc. sine NSF - - 2.72
abs-mel-noi - mel-spc. noise NSF - - 3.81
taco2-mfb-sin taco2 midi-fb sine NSF 4.61 6.34 2.97
taco2-mfb-noi taco2 midi-fb noise NSF 4.66 6.36 3.18
taco3-mfb-sin taco3 midi-fb sine NSF 4.78 6.48 3.19
taco3-mfb-noi taco3 midi-fb noise NSF 4.89 6.53 3.19
taco4-mfb-sin taco4 midi-fb sine NSF 4.86 6.39 2.98
taco4-mfb-noi taco4 midi-fb noise NSF 4.97 6.42 2.95
pfnet-mfb-sin PFNet midi-fb sine NSF 5.59 7.14 3.10
pfnet-mfb-noi PFNet midi-fb noise NSF 5.78 7.26 3.05
pfnet-mel-sin PFNet mel-spec. sine NSF 5.66 7.17 1.82
pfnet-mel-noi PFNet mel-spec. noise NSF 5.74 7.25 2.93
pfnet-spec-GL PFNet spec. - GL 5.43 6.98 1.62
midi-sin-nsf - - sine NSF 4.32 6.40 2.88
midi-noi-nsf - - noise NSF 4.40 6.08 2.63
MIDI-to-audio
Wav
Acoustic
model
NSF
MIDI
API
Sine
Wav
Acoustic
model
NSF
MIDI
API
Noise
MIDI-to-audio
Wav
NSF
MIDI
API
Sine
Wav
NSF
MIDI
API
Noise
31. 32
Appendix
Caetano, Marcelo, and Xavier Rodet. "A source-filter model for musical instrument sound transformation." ICASSP. IEEE, 2012.
Klapuri, Anssi, Tuomas Virtanen, and Toni Heittola. "Sound source separation in monaural music signals using excitation-filter model and em algorithm."
ICASSP. IEEE, 2010.
q On music audio and speech waveform
32. 33
Appendix
Caetano, Marcelo, and Xavier Rodet. "A source-filter model for musical instrument sound transformation." ICASSP. IEEE, 2012.
Klapuri, Anssi, Tuomas Virtanen, and Toni Heittola. "Sound source separation in monaural music signals using excitation-filter model and em algorithm."
ICASSP. IEEE, 2010.
q On music audio and speech waveform