SlideShare a Scribd company logo
1 of 33
Download to read offline
Text-to-Speech Synthesis Techniques for
MIDI-to-Audio Synthesis
Erica Cooper, Xin Wang , Junichi Yamagishi
National Institute of Informatics, Japan
2
Motivation
.txt
Speech recognition
Music transcription
Language
modeling
<latexit sha1_base64="/CH3PPEHB2uw/n9QUXfzgmYucMQ=">AAAChnicbZHNbtNAEMc3BkoIUFp65GIRVSqXyEaN2mMkLhyLRNpKsVvNrifJKvth7Y5pI8vvwRXeirdhneRQtx1pd//6zYxmZoeXSnpKkn+96MXLV3uv+28Gb9+93/9wcPjx0tvKCZwKq6y75uBRSYNTkqTwunQImiu84qtvrf/qFzovrflJ6xJzDQsj51IABXRTnmSE91S3V/Pl9mCYjJKNxU9FuhNDtrOL28MezworKo2GhALvZ2lSUl6DIykUNoOs8liCWMECZ0Ea0OjzetN2Ex8HUsRz68IxFG/ow4watPdrzUOkBlr6x74WPuebVTQ/z2tpyorQiG2heaVisnH7B3EhHQpS6yBAOBl6jcUSHAgKPzXIDN4JqzWYos64b2ZpHl6rirYXq+ph2nQG245DXD1Du2jhoFxKcd+l3NoVQSjUobpSJJ29awZhK+njHTwVl19H6XiU/DgdTs53++mzT+wzO2EpO2MT9p1dsCkTzLHf7A/7G/WjUTSOzrahUW+Xc8Q6Fk3+Ax6ryYQ=</latexit>
p(text)
<latexit sha1_base64="EAl9tgxIUnBZZ5Qql+WkJlo34lI=">AAAClnicbZHNahsxEMflbdqm7pfTXgLNYYkppBezG1qSUwmU0hxdiJOAdzGSdmwL6wtpNonZ7qVPk2v6Nn2bam0fukkGJP35zQwzo2FWCo9J8rcTPdl6+uz59ovuy1ev37zt7bw796Z0HEbcSOMuGfUghYYRCpRwaR1QxSRcsMW3xn9xBc4Lo89waSFXdKbFVHCKAU16e/YgQ7jBqrnqX2vtLQCf158mvX4ySFYWPxTpRvTJxoaTnQ7LCsNLBRq5pN6P08RiXlGHgkuou1npwVK+oDMYB6mpAp9XqzHq+GMgRTw1LhyN8Yr+n1FR5f1SsRCpKM79fV8DH/ONS5we55XQtkTQfF1oWsoYTdz8SVwIBxzlMgjKnQi9xnxOHeUYfq6babjmRimqiypjvh6neXiNLJpejKz6ad0abD0OMvkIbaOZo3Yu+E2bMmMWSEOhFlWlROHMdd0NW0nv7+ChOD8cpF8Gyc/P/ZPjzX62yQeyTw5ISo7ICTklQzIinPwmt+SO/Il2o6/R9+jHOjTqbHLek5ZFw3/MD9BO</latexit>
p(text|speech)
Music
language
modeling
<latexit sha1_base64="pHfXu2ZEYiI4q/eUfRR0x2p1KvU=">AAACiXicbZFLT9tAEMc3hhYa2hLKsReLqBK9RHbVqhEnJC49gtQAUmxFs+tJsso+rN1xIbL8SbjSD9Vvw+ZxqIGRVvvXb2Y0L14q6SlJ/nWind03b/f233UP3n/4eNg7+nTtbeUEjoRV1t1y8KikwRFJUnhbOgTNFd7wxcXKf/MHnZfW/KZlibmGmZFTKYACmvQOy9OM8J5qYwl983XS6yeDZG3xS5FuRZ9t7XJy1OFZYUWl0ZBQ4P04TUrKa3AkhcKmm1UeSxALmOE4SAMafV6vO2/iL4EU8dS68AzFa/p/Rg3a+6XmIVIDzf1z3wq+5htXNB3mtTRlRWjEptC0UjHZeLWGuJAOBallECCcDL3GYg4OBIVldTODd8JqDaaoM+6bcZqH36pi1YtVdT9tWoNtxiGuXqFtNHNQzqW4b1Nu7YIgFGpRXSmSzt413XCV9PkNXorrb4P0xyC5+t4/H27vs88+sxN2ylL2k52zX+ySjZhgFXtgj+xvdBCl0TA624RGnW3OMWtZdPEEo3HKIw==</latexit>
p(notes)
<latexit sha1_base64="i8J00zLt+oTu0HMgV0D/FkfOs04=">AAAClnicbZFLaxsxEMfl7St1X05zKbSHpaaQXsxuaElOJVBCenShTgLexYy0Y1tYj0WaTWK2e+mn6bX9Nv02kR+HbpIBob9+M8PMaHippKck+deJHjx89PjJztPus+cvXr7q7b4+87ZyAkfCKusuOHhU0uCIJCm8KB2C5grP+eLryn9+ic5La37QssRcw8zIqRRAAU1678r9jPCaamMJffNz84CqkLb5OOn1k0GytviuSLeiz7Y2nOx2eFZYUWk0JBR4P06TkvIaHEmhsOlmlccSxAJmOA7SgEaf1+sxmvhDIEU8tS4cQ/Ga/p9Rg/Z+qXmI1EBzf9u3gvf5xhVNj/JamrIiNGJTaFqpmGy8+pO4kA4FqWUQIJwMvcZiDg4EhZ/rZgavhNUaTFFn3DfjNA+3VcWqF6vqftq0BtuMQ1zdQ9to5qCcS3HdptzaBUEo1KK6UiSdvWq6YSvp7R3cFWcHg/TzIPn+qX98tN3PDnvL3rN9lrJDdsy+sSEbMcF+sd/sD/sbvYm+RCfR6SY06mxz9ljLouENxwjQTA==</latexit>
p(notes|audio)
TTS
<latexit sha1_base64="AdtZH2eD4foaVT6LJ+S3wTTUAPQ=">AAAClnicbZHNahsxEMflbdqm7pfTXgLNYYkppBezG1qSUwmU0hxdiJOAdzGSdmwL6wtpNonZ7qVPk2v6Nn2bam0fukkGJP31mxlGo2FWCo9J8rcTPdl6+uz59ovuy1ev37zt7bw796Z0HEbcSOMuGfUghYYRCpRwaR1QxSRcsMW3xn9xBc4Lo89waSFXdKbFVHCKAU16e/YgQ7jBylsAPq9/rW/NVn+a9PrJIFlZ/FCkG9EnGxtOdjosKwwvFWjkkno/ThOLeUUdCi6h7malB0v5gs5gHKSmCnxerdqo44+BFPHUuLA0xiv6f0ZFlfdLxUKkojj3930NfMw3LnF6nFdC2xJB83WhaSljNHHzJ3EhHHCUyyAodyK8NeZz6ijH8HPdTMM1N0pRXVQZ8/U4zcNpZNG8xciqn9atxtbtIJOP0DaaOWrngt+0KTNmgTQUalFVShTOXNfdMJX0/gweivPDQfplkPz83D853sxnm3wg++SApOSInJBTMiQjwslvckvuyJ9oN/oafY9+rEOjzibnPWlZNPwHyb/QTg==</latexit>
p(speech|text)
<latexit sha1_base64="W5X5+Q/S5r7xUsdooj/ITPBr7TI=">AAAClnicbZFLaxsxEMfl7St1X05zKbSHpaaQXsxuaElOJVBCenShTgLexYy0Y1tYj0WaTWK2e+mn6bX9Nv02kR+HbpIBob9+M8PMaHippKck+deJHjx89PjJztPus+cvXr7q7b4+87ZyAkfCKusuOHhU0uCIJCm8KB2C5grP+eLryn9+ic5La37QssRcw8zIqRRAAU1678r9jPCaaqgKaZufm4exhL75OOn1k0GytviuSLeiz7Y2nOx2eFZYUWk0JBR4P06TkvIaHEmhsOlmlccSxAJmOA7SgEaf1+sxmvhDIEU8tS4cQ/Ga/p9Rg/Z+qXmI1EBzf9u3gvf5xhVNj/JamrIiNGJTaFqpmGy8+pO4kA4FqWUQIJwMvcZiDg4EhZ/rZgavhNUaTFFn3DfjNA+3VcWqF6vqftq0BtuMQ1zdQ9to5qCcS3HdptzaBUEo1KK6UiSdvWq6YSvp7R3cFWcHg/TzIPn+qX98tN3PDnvL3rN9lrJDdsy+sSEbMcF+sd/sD/sbvYm+RCfR6SY06mxz9ljLouENxd3QTA==</latexit>
p(audio|notes)
Voice
conversion
<latexit sha1_base64="qCnWH7hHL5prGMhtqwPIklqtHAQ=">AAACnHicbZFLSxxBEMd7RxN1zWNNjgEZXARzWWYkokchOQRCQCG7CjvD0t1Tu9tsv+iuUZfJ3P00uepXybdJ7+OQUQua/vevqqiqLmal8Jgkf1vRxuar11vbO+3dN2/fve/sfRh4UzoOfW6kcdeMepBCQx8FSri2DqhiEq7Y7OvCf3UDzgujf+HcQq7oRIux4BQDGnUO7FGGcIeVtwB8Wo/S38338edRp5v0kqXFz0W6Fl2ytovRXotlheGlAo1cUu+HaWIxr6hDwSXU7az0YCmf0QkMg9RUgc+r5TB1fBhIEY+NC0djvKT/Z1RUeT9XLEQqilP/1LeAL/mGJY7P8kpoWyJovio0LmWMJl78TFwIBxzlPAjKnQi9xnxKHeUY/q+dabjlRimqiypjvh6mebiNLBa9GFl107ox2GocZPIF2kQTR+1U8LsmZcbMkIZCDapKicKZ27odtpI+3cFzMTjupSe95PJL9/xsvZ9t8okckCOSklNyTr6TC9InnNyTP+SBPEb70bfoR/RzFRq11jkfScOiwT/CitJe</latexit>
p(speech1|speech2)
<latexit sha1_base64="0cEBiVQeRCFX5q0A6TnvZPw0BTo=">AAACmnicbZFLbxMxEMed5VXCK4UjHCwipHKJdisQPVZwAXEpUtNWyq6isXeSWPFjZc/SRste+TRc4bvwbXAeB7btSJb//s2MZsYjKq0CpenfXnLn7r37D/Ye9h89fvL02WD/+VlwtZc4lk47fyEgoFYWx6RI40XlEYzQeC6Wn9b+8+/og3L2lFYVFgbmVs2UBIpoOuDVQU54RQ3UpXLtNPvReR6+nQ6G6SjdGL8psp0Ysp2dTPd7Ii+drA1akhpCmGRpRUUDnpTU2PbzOmAFcglznERpwWAoms0oLX8TSclnzsdjiW/o/xkNmBBWRsRIA7QI131reJtvUtPsqGiUrWpCK7eFZrXm5Pj6X3ipPErSqyhAehV75XIBHiTF3+vnFi+lMwZs2eQitJOsiLfT5boXp5th1nYG245DQt9Cu2juoVooedWlwrklQSzUoabWpLy7bPtxK9n1HdwUZ4ej7P0o/fZueHy0288ee8leswOWsQ/smH1mJ2zMJPvJfrHf7E/yKvmYfEm+bkOT3i7nBetYcvoPs03Rfg==</latexit>
p(audio1|audio2)
Timbre
conversion
MIDI-to-audio
3
TTS & MIDI-to-Audio Synthesis
0 100 200 300 400 500
Frame index
256
513
Frequency
bins
Acoustic
model
Waveform
model
Wav
MIDI
Piano roll
Acoustic
features
MIDI
API
Acoustic
model
Waveform
model
Context
vectors
Acoustic
features
Front
end
Wav
.txt
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
or
Pitch is crucial for music
4
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
or
TTS & MIDI-to-Audio Synthesis
https://commons.wikimedia.org/wiki/File:Computer_music_piano_roll.png
0 100 200 300 400 500
Frame index
256
513
Frequency
bins
Acoustic
model
Waveform
model
Wav
MIDI
Piano roll
Acoustic
features
MIDI
API
Acoustic
model
Waveform
model
Context
vectors
Acoustic
features
Front
end
Wav
.txt
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
.txt
Music score MIDI piano roll
MIDI
API
5
Acoustic
model
Waveform
model
Wav
MIDI
Piano roll
Acoustic
features
MIDI
API
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
or
TTS & MIDI-to-Audio Synthesis
0 100 200 300 400 500
Frame index
256
513
Frequency
bins
Acoustic
model
Waveform
model
Context
vectors
Acoustic
features
Front
end
Wav
.txt
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
1
0
0
Phone ID (one-hot)
Syllable ID (one-hot)
#. Phone
#. Syllables
<latexit sha1_base64="YYtcP5HcZDv4wo3N4OOgsso9tf8=">AAAC6nicbVLLahsxFJWnr8R9Je0ym6Gm0E3MTElploFuugouxInBMxiN5toW1mOQ7jQ2Qj+RXUmW/Zz+QP+m8iPQmeSC0OHce+5LKirBLSbJ30705Omz5y/29rsvX71+8/bg8N2l1bVhMGRaaDMqqAXBFQyRo4BRZYDKQsBVsfi29l/9BGO5Vhe4qiCXdKb4lDOKgRplhXVLP1GTg17STzYWPwTpDvTIzgaTw86frNSslqCQCWrtOE0qzB01yJkA381qCxVlCzqDcYCKSrC52zTs44+BKeOpNuEojDfs/wpHpbUrWYRISXFu2741+ZhvXOP0NHdcVTWCYttC01rEqOP19HHJDTAUqwAoMzz0GrM5NZRh2FE3U3DNtJRUlS5sxo/TPNxalOtetHC91DcG246DhXiEbVIzQ6s5Z0vfbRYZXBhOhXcZmwNbSGoWvhUAymgRIo5bjvN7JcISNxWdgbKRqa24T9WWHHsf3j9tv/ZDcPm5n37pJz9Oemfp7ifskSPygXwiKflKzsh3MiBDwoggN+SW3EUiuol+Rbfb0Kiz07wnDYt+/wP+EPS4</latexit>
xn
<latexit sha1_base64="YYtcP5HcZDv4wo3N4OOgsso9tf8=">AAAC6nicbVLLahsxFJWnr8R9Je0ym6Gm0E3MTElploFuugouxInBMxiN5toW1mOQ7jQ2Qj+RXUmW/Zz+QP+m8iPQmeSC0OHce+5LKirBLSbJ30705Omz5y/29rsvX71+8/bg8N2l1bVhMGRaaDMqqAXBFQyRo4BRZYDKQsBVsfi29l/9BGO5Vhe4qiCXdKb4lDOKgRplhXVLP1GTg17STzYWPwTpDvTIzgaTw86frNSslqCQCWrtOE0qzB01yJkA381qCxVlCzqDcYCKSrC52zTs44+BKeOpNuEojDfs/wpHpbUrWYRISXFu2741+ZhvXOP0NHdcVTWCYttC01rEqOP19HHJDTAUqwAoMzz0GrM5NZRh2FE3U3DNtJRUlS5sxo/TPNxalOtetHC91DcG246DhXiEbVIzQ6s5Z0vfbRYZXBhOhXcZmwNbSGoWvhUAymgRIo5bjvN7JcISNxWdgbKRqa24T9WWHHsf3j9tv/ZDcPm5n37pJz9Oemfp7ifskSPygXwiKflKzsh3MiBDwoggN+SW3EUiuol+Rbfb0Kiz07wnDYt+/wP+EPS4</latexit>
xn
A4, velocity 0.4
0.9
0
D5, velocity 0.4
…
0.4
…
0
<latexit sha1_base64="YYtcP5HcZDv4wo3N4OOgsso9tf8=">AAAC6nicbVLLahsxFJWnr8R9Je0ym6Gm0E3MTElploFuugouxInBMxiN5toW1mOQ7jQ2Qj+RXUmW/Zz+QP+m8iPQmeSC0OHce+5LKirBLSbJ30705Omz5y/29rsvX71+8/bg8N2l1bVhMGRaaDMqqAXBFQyRo4BRZYDKQsBVsfi29l/9BGO5Vhe4qiCXdKb4lDOKgRplhXVLP1GTg17STzYWPwTpDvTIzgaTw86frNSslqCQCWrtOE0qzB01yJkA381qCxVlCzqDcYCKSrC52zTs44+BKeOpNuEojDfs/wpHpbUrWYRISXFu2741+ZhvXOP0NHdcVTWCYttC01rEqOP19HHJDTAUqwAoMzz0GrM5NZRh2FE3U3DNtJRUlS5sxo/TPNxalOtetHC91DcG246DhXiEbVIzQ6s5Z0vfbRYZXBhOhXcZmwNbSGoWvhUAymgRIo5bjvN7JcISNxWdgbKRqa24T9WWHHsf3j9tv/ZDcPm5n37pJz9Oemfp7ifskSPygXwiKflKzsh3MiBDwoggN+SW3EUiuol+Rbfb0Kiz07wnDYt+/wP+EPS4</latexit>
xn
<latexit sha1_base64="YYtcP5HcZDv4wo3N4OOgsso9tf8=">AAAC6nicbVLLahsxFJWnr8R9Je0ym6Gm0E3MTElploFuugouxInBMxiN5toW1mOQ7jQ2Qj+RXUmW/Zz+QP+m8iPQmeSC0OHce+5LKirBLSbJ30705Omz5y/29rsvX71+8/bg8N2l1bVhMGRaaDMqqAXBFQyRo4BRZYDKQsBVsfi29l/9BGO5Vhe4qiCXdKb4lDOKgRplhXVLP1GTg17STzYWPwTpDvTIzgaTw86frNSslqCQCWrtOE0qzB01yJkA381qCxVlCzqDcYCKSrC52zTs44+BKeOpNuEojDfs/wpHpbUrWYRISXFu2741+ZhvXOP0NHdcVTWCYttC01rEqOP19HHJDTAUqwAoMzz0GrM5NZRh2FE3U3DNtJRUlS5sxo/TPNxalOtetHC91DcG246DhXiEbVIzQ6s5Z0vfbRYZXBhOhXcZmwNbSGoWvhUAymgRIo5bjvN7JcISNxWdgbKRqa24T9WWHHsf3j9tv/ZDcPm5n37pJz9Oemfp7ifskSPygXwiKflKzsh3MiBDwoggN+SW3EUiuol+Rbfb0Kiz07wnDYt+/wP+EPS4</latexit>
xn
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
6
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
TTS & MIDI-to-Audio Synthesis
0 100 200 300 400 500
Frame index
256
513
Frequency
bins
Acoustic
model
Waveform
model
Wav
MIDI
Piano roll
Acoustic
features
MIDI
API
Acoustic
model
Waveform
model
Context
vectors
Acoustic
features
Front
end
Wav
.txt
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
Not completely aligned
AI
performer
<latexit sha1_base64="Ce+Vsx7LGNLmrE4StAaBzOgsSt0=">AAADInicbVLLitswFFXc1zR9ZdplN6Kh0M0Eu7S0dFEGuummQwrJzEBswrWsxCJ6GEnuJBj9zXzN0E2Zbgr9mMpOBmpnLggdzrlHV/dKacGZsWH4pxfcuXvv/oODh/1Hj588fTY4fH5qVKkJnRLFlT5PwVDOJJ1aZjk9LzQFkXJ6lq6+1PrZD6oNU3JiNwVNBCwlWzAC1lPzwec4NdXazavo0zeHY82WuQWt1QWuBWiEk31BNcLEzQfDcBQ2gfdBtANDtIvx/LD3M84UKQWVlnAwZhaFhU0q0JYRTl0/Lg0tgKxgSWceShDUJFXTqMOvPZPhhdJ+SYsb9n9HBcKYjUh9pgCbm65Wk7dps9IuPiYVk0VpqSTbQouSY6twPTWcMU2J5RsPgGjm74pJDhqI9bPtx5JeECUEyKzy03GzKPG74ll9F8WrYeRajW3bsSm/hW1TSw1Fzsja9dtFxhPNgLsqJjklKwF65ToJVGrFfcZRRzi5cVq6tk3FStOsdVLXcXNU13Lk6vePuq+9D07fjqL3o/D7u+FxtPsJB+gleoXeoAh9QMfoKxqjKSLoEl2ha/Q7uAyugl/B9TY16O08L1Argr//ADF1C2g=</latexit>
x1:M ! a1:N ! o1:T
<latexit sha1_base64="uTpzxJX9ikqk62YuUY3zkv6CAV4=">AAADInicdVJda9swFFW8ry77SrfHvYiFwV4a7LGx0YdR2MueSgZJW4hNuJaVWESWjCS3CUL/pr+m7GV0L4P9mMlOCrPbXRA6nHOPru6V0pIzbcLwdy+4d//Bw0d7j/tPnj57/mKw//JEy0oROiWSS3WWgqacCTo1zHB6VioKRcrpabr6Wuun51RpJsXEbEqaFLAUbMEIGE/NB1/iVNu1m9vo8NjhWLFlbkApeYFrAf4nyEaYuPlgGI7CJvBtEO3AEO1iPN/v/YgzSaqCCkM4aD2LwtIkFpRhhFPXjytNSyArWNKZhwIKqhPbNOrwW89keCGVX8Lghv3XYaHQelOkPrMAk+uuVpN3abPKLD4nlomyMlSQbaFFxbGRuJ4azpiixPCNB0AU83fFJAcFxPjZ9mNBL4gsChCZ9dNxsyjxu+RZfRfJ7TByrca27ZiU38G2qaWCMmdk7frtIuOJYsCdjUlOyaoAtXKdBCqU5D7joCMc3zgNXZumolU0a53Uddwc1bUcuPr9o+5r3wYn70fRx1H4/cPwKNr9hD30Gr1B71CEPqEj9A2N0RQRdImu0DX6FVwGV8HP4HqbGvR2nleoFcGfvzQnC2k=</latexit>
x1:N ! a1:N ! o1:T
Aligned
Silent
instrument
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
or
7
<latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit>
x1:M
TTS & MIDI-to-Audio Synthesis
0 100 200 300 400 500
Frame index
256
513
Frequency
bins
Acoustic
model
Waveform
model
Wav
MIDI
Piano roll
Acoustic
features
MIDI
API
Acoustic
model
Waveform
model
Context
vectors
Acoustic
features
Front
end
Wav
.txt
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
Apply TTS techniques
to MIDI-to-Audio
8
Methods
Wang, Y. et al. Tacotron: Towards End-to-End Speech Synthesis. in Proc. Interspeech 4006–4010 (2017).
Yasuda, Y., Wang, X., Takaki, S. & Yamagishi, J. Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent
language. in Proc. ICASSP 6905–6909 (2019).
q Acoustic model
1. TTS model: Tacotron (Wang 2017, Yasuda 2019)
Acoustic
model
Waveform
model
Wav
MIDI
API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
Dense
CBH-
LSTM
Self
Attention
Attention
RNN
Additive
Attention
Forward
Attention
Concat
Concat
Stop
Token
Pre-Net
Decoder
RNN
Self
Attention
Sigmoid Linear Waveform
model
Post-Net
Wav
⊕
⊕
⊕
Encoder Decoder
• Taco2: 800-frame
segments; 4x
downsampling for
better alignments
• Taco3: Warm-
started from taco2;
input current piano-
roll frame at the
decoder pre-net
• Taco4: Warm-
started from taco2;
no downsampling or
piano-roll input to
pre-net
9
Methods
Wang, B. & Yang, Y.-H. PerformanceNet: Score-to-audio music generation with multi-band convolutional residual network. in Proceedings of the
AAAI Conference on Artificial Intelligence vol. 33 1174–1181 (2019).
q Acoustic model
2. Reference model from music field: PerformanceNet (Wang 2019)
Acoustic
model
Waveform
model
Wav
MIDI
API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
Adopted from figure 3 in (Wang 2019)
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
<latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
10
Methods
Librosa midi_to_hz: https://librosa.org/doc/0.7.0/generated/librosa.core.midi_to_hz.html
q Acoustic features
1. Mel-spectrogram
2. MIDI-filterbank-spectrogram
Acoustic
model
Waveform
model
Wav
MIDI
API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
120 233 346 460
Frame index
16
32
48
64
80
96
112
128
Pinao
roll
index
(1-128)
Piano roll
Mel filter bank MIDI filter bank
E4
E6
E8
Frequency (0-12 kHz)
7
29
61
Mel
filter
index
(1-80)
E4
E6
E8
Frequency (0-12 kHz)
64
88
112
MIDI
filter
index
(1-128)
120 233 346 460
Frame index
7
29
61
Dimension
index
(1-80)
Mel-spectrum
120 233 346 460
Frame index
64
88
112
Dimension
index
(1-128)
MIDI-fb-spectrum
Mel-spec. MIDI-fb-spec.
<latexit sha1_base64="6X1VLiCNlG3P2EJtZGp3iOqzN64=">AAACl3icbZFbaxNBFMcn663GW6pPIsJgEOqDcTfEWh/EgiB9bMG0hewazszOJkPmssycbQ3LPvlpfNVP47dxcnlw2x4Y5s/vnMO5sVJJj3H8txPdun3n7r2d+90HDx89ftLbfXrqbeW4GHOrrDtn4IWSRoxRohLnpROgmRJnbPFl5T+7EM5La77hshSZhpmRheSAAU17Lwv6iQ6/13uLt/sf37yjybChKUotPB2N4mmvHw/itdHrItmKPtna8XS3w9Lc8koLg1yB95MkLjGrwaHkSjTdtPKiBL6AmZgEaSBUyur1HA19HUhOC+vCM0jX9P+MGrT3S81CpAac+6u+FbzJN6mwOMhqacoKheGbQkWlKFq6WgrNpRMc1TII4E6GXimfgwOOYXXd1IhLbrUGk9cp880kycJvVb7qxaq6nzStwTbjIFM30DaaOSjnkv9oU2btAiEUalFdKZTOXjbdcJXk6g2ui9PhIHk/iE9G/cOD7X12yAvyiuyRhHwgh+SIHJMx4eQn+UV+kz/R8+hz9DU62oRGnW3OM9Ky6OQfqXrMdw==</latexit>
f = 2(k 69)/12
⇥ 440
<latexit sha1_base64="86+t9jn0UkxzHxA2if7/PbZ7/JQ=">AAACeXicbZHNThtBDMcn2xbSUGhoj71sGyEhDtEuLSrHSL30CFIDSMkKeWa9ySjzsZrxlkarfYJe24fjWbh08nHoApas+etnW7bHvFTSU5Lcd6IXL1/t7HZf9/be7B+87R++u/K2cgLHwirrbjh4VNLgmCQpvCkdguYKr/ni2yp+/ROdl9b8oGWJmYaZkYUUQAFdFrf9QTJM1hY/FelWDNjWLm4PO3yaW1FpNCQUeD9Jk5KyGhxJobDpTSuPJYgFzHASpAGNPqvXkzbxUSB5XFgX3FC8pv9X1KC9X2oeMjXQ3D+OreBzsUlFxXlWS1NWhEZsGhWVisnGq7XjXDoUpJZBgHAyzBqLOTgQFD6nNzV4J6zWYPJ6yn0zSbPwWpWvZrGqHqRNa7HNOsTVM7SNZg7KuRS/2pRbuyAIjVpUV4qks3dNL1wlfXyDp+LqdJieDZPLL4PR+fY+XfaBfWLHLGVf2Yh9ZxdszARD9pv9YX87D9HH6Dg62aRGnW3Ne9ay6PM/WXrEyA==</latexit>
f
<latexit sha1_base64="JbTkpy7RgEzAoI7aVLnTtzC7Ygk=">AAACeXicbZHNThtBDMcn2xZoKBDaYy9LIyTEIdqFonJE4tIjSA0gJSvkmXWSUeZjNeMtjVb7BFzh4XgWLp18HFjAkjV//WzL9pgXSnpKkqdW9OHjp7X1jc/tzS9b2zud3a9X3pZOYF9YZd0NB49KGuyTJIU3hUPQXOE1n57P49d/0XlpzR+aFZhpGBs5kgIooMvpbaeb9JKFxW9FuhJdtrKL290WH+ZWlBoNCQXeD9KkoKwCR1IorNvD0mMBYgpjHARpQKPPqsWkdbwfSB6PrAtuKF7QlxUVaO9nmodMDTTxr2Nz+F5sUNLoNKukKUpCI5aNRqWKycbzteNcOhSkZkGAcDLMGosJOBAUPqc9NHgnrNZg8mrIfT1Is/Balc9nsarqpnVjseU6xNU7tInGDoqJFP+alFs7JQiNGlSXiqSzd3U7XCV9fYO34uqol570ksuf3bPT1X022Hf2gx2wlP1iZ+w3u2B9Jhiye/bAHlvP0V50EB0uU6PWquYba1h0/B9j48TN</latexit>
k
Librosa.midi_to_hz
11
Methods
Blank lines in MIDI-fb-spec. due to frequency resolution of FFT (see appendix)
q Acoustic features
1. Mel-spectrogram
2. MIDI-filterbank-spectrogram
Acoustic
model
Waveform
model
Wav
MIDI
API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
120 233 346 460
Frame index
16
32
48
64
80
96
112
128
Pinao
roll
index
(1-128)
Piano roll
Mel filter bank MIDI filter bank
E4
E6
E8
Frequency (0-12 kHz)
7
29
61
Mel
filter
index
(1-80)
E4
E6
E8
Frequency (0-12 kHz)
64
88
112
MIDI
filter
index
(1-128)
120 233 346 460
Frame index
7
29
61
Dimension
index
(1-80)
Mel-spectrum
120 233 346 460
Frame index
64
88
112
Dimension
index
(1-128)
MIDI-fb-spectrum
Mel-spec. MIDI-fb-spec.
12
Methods
Zhao, Y., Wang, X., Juvela, L. & Yamagishi, J. Transferring neural speech waveform synthesizers to musical instrument sounds generation. in
Proc. ICASSP 6269–6273 (IEEE, 2020). doi:10.1109/ICASSP40776.2020.9053047
q Waveform model
§ Based on music neural source-filter (NSF) model (Zhao 2020) but
• No harmonic-plus-noise structure
Acoustic
model
Waveform
model
Wav
MIDI
API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit>
a1:N
<latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit>
o1:T
<latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit>
x1:N
Bi-LSTM 1D CNN Up-sampling
Acoustic
features
Excitation
signal
Block
5
Wav
…
FC
Dilated
1D conv
Dilated
1D conv FC
Block 1
…
Condition module
Neural filter module
13
Experiments
Hawthorne, C. et al. Enabling factorized piano music modeling and generation with the MAESTRO dataset. in Proc. ICLR (2018).
https://piano-e-competition.com/
q Database: MAESTRO v2.0 (Hawthorne 2018)
§ Real piano performances in International Piano-e-Competition
§ MIDI was recorded simultaneously during performance
• Aligned audio & piano roll
§ For experiments:
• Follow official data split
• 24kHz, 16 bits PCM
Data split of MAESTRO
From https://magenta.tensorflow.org/datasets/maestro#v200
14
Experiments
https://www.fluidsynth.org/
https://www.modartt.com/pianoteq
https://github.com/bwang514/PerformanceNet
q Models in comparison
Software
<latexit sha1_base64="QvEzsRPQOsJO2uPs95Rt4gD1HG8=">AAAM5XiclVbdbuNEFPYuC9mGAFu45MaiYrUg2bId56d3i1i6RNotQaVppLqqxpNJYtX2BM9428jyI3CHkOCGV+EW3oC34cxMksY/SYVTy9PxOd985ztzzthfhAHjlvXvo8fvPXn/g8bTg+aHrY8+/uTZ4acjRtMEk3NMQ5qMfcRIGMTknAc8JONFQlDkh+TCv/lWvL94RxIW0PgnvlyQqwjN4mAaYMRh6vqw0fZ8MgvijCM/DVGSZ3h15U2P00WShqTpRWnIg4TeZk6efZ1nZ0vGSaQPXuX6c7380mNzmnDGEb7JvsE0ZTzAuufpEZ2QMP8/HlOCeJqQB32+u8MBN4UHC2YxeniRC/SOmHWcQM80ioUHzrNhwPFcjwIWIRg8hPn2hzOB+CIiKP4qzz2v6WGRlKxr9PKmXnM9lz/xB3dMOYEHBsCJWAnco2Ai1T8VKqAQZo0dt2tabsHFY0s2HozKcbl5FkL2ULQIiSF2zUR/O3g1MDg1UDoJqM7olN8ikFyw6piOJZ5ds9cTFNtmtytXEeA7sIfzJYO9FRpS2X3ortl3FHpHrJK5ptPJi0Egn8kYRIgwGRhTH0YMRIXH6dmJFE5poPj1exJA+A0qfjEN2Max6Nfb8ts4ktBgC2wWV9z2c8yes/Greu1Zr29XsjVSK3uc3HHOoRoxdfJdcR+AbiBY186Vgm1X0jnurdMzeghsm9yBzAbkVmF1JUW7v0m1XoJq7+aloHp9BeX2FdTxGmow3g9VZdU/Xu2RdglqXInQ3SsXQK0DPFZi3Qc4KEfoPkQLlFYROgqrcy88TAxPTgnfrZFUqWN2ZGg903ZVaNZGpVEtRoHFGkRJ3TMdlTXrnsg2htyUpLyXNyAq9cBERGWLulwzGQx2wJTIKC7uiktHidLeypWUeA20BhEl8fqN5AEIHdNtq7bYheTIKdvsOpVK2VSoUS5NSEx71VNcS1Vof5Plba9SaYKfu+p0Vl+6dRV3n3JOI7UyiSebM/L62ZFlWvLSqwN7NTjSVtfw+vDJ796E4jQiMcchYuzSthb8KkMJ9M+QwFGbMrKAYwTNyCUMYxQRdpXJ4z3Xv0xFp57SBO6Y63J22yNDEWPLyAdLOKrmrPxOTNa9u0z5tH+VBfEi5STGaqFpGuqc6uJbQZ8ECcE8XMIA4SQQxzKeowRhDl8UTS8mt5hGEQJlPJ/ll/YVPGk4EVxomB3ZeSEwFQ73w5rZ4tQsQYt5gO+Ksz6lN5ABVpxdH8l5s0hINGU4oVdVDf8ZYv9C5gWpsmHZcurvtKwBhU1YZzqqgtZYiq15byjLHZY2YjatsyybAuBO0y3bxTQmfJcAsuVUbOslEJ2lDrY+sFGVwk4RRtt85Ym1j8Og3nqHvuOibXsPcq1pPey4RMHdB1vm6+4FruZO9Ezj9Zta662MnIRpAPUXQ8HXWG4ZDgMkPjl/zkUzs8utqzoYOabdMa0f3aOX/VVbe6p9rn2hvdBsrae91L7Xhtq5hht/NP5q/N34pzVr/dL6tfWbMn38aOXzmVa4Wn/+B5czF8c=</latexit>
System ID
Acoustic
model
Acoustic
feature
Excit.
signal
Wave.
model
Pitch mismatch MOS
(mean)
note chord
Natural - - - - - - 4.04
Fluidsynth Sample-based MIDI-to-audio software 5.20 6.77 3.66
Pianoteq Physical-model MIDI-to-audio software 4.82 6.50 4.25
abs-mfb-sin - midi-fb sine NSF - - 3.87
abs-mfb-noi - midi-fb noise NSF - - 3.77
abs-mel-sin - mel-spc. sine NSF - - 2.72
abs-mel-noi - mel-spc. noise NSF - - 3.81
taco2-mfb-sin taco2 midi-fb sine NSF 4.61 6.34 2.97
taco2-mfb-noi taco2 midi-fb noise NSF 4.66 6.36 3.18
taco3-mfb-sin taco3 midi-fb sine NSF 4.78 6.48 3.19
taco3-mfb-noi taco3 midi-fb noise NSF 4.89 6.53 3.19
taco4-mfb-sin taco4 midi-fb sine NSF 4.86 6.39 2.98
taco4-mfb-noi taco4 midi-fb noise NSF 4.97 6.42 2.95
pfnet-mfb-sin PFNet midi-fb sine NSF 5.59 7.14 3.10
pfnet-mfb-noi PFNet midi-fb noise NSF 5.78 7.26 3.05
pfnet-mel-sin PFNet mel-spec. sine NSF 5.66 7.17 1.82
pfnet-mel-noi PFNet mel-spec. noise NSF 5.74 7.25 2.93
pfnet-spec-GL PFNet spec. - GL 5.43 6.98 1.62
midi-sin-nsf - - sine NSF 4.32 6.40 2.88
midi-noi-nsf - - noise NSF 4.40 6.08 2.63
Software
Original PerformanceNet
Open-source
code
15
Experiments
https://www.fluidsynth.org/
https://www.modartt.com/pianoteq
q Models in comparison <latexit sha1_base64="QvEzsRPQOsJO2uPs95Rt4gD1HG8=">AAAM5XiclVbdbuNEFPYuC9mGAFu45MaiYrUg2bId56d3i1i6RNotQaVppLqqxpNJYtX2BM9428jyI3CHkOCGV+EW3oC34cxMksY/SYVTy9PxOd985ztzzthfhAHjlvXvo8fvPXn/g8bTg+aHrY8+/uTZ4acjRtMEk3NMQ5qMfcRIGMTknAc8JONFQlDkh+TCv/lWvL94RxIW0PgnvlyQqwjN4mAaYMRh6vqw0fZ8MgvijCM/DVGSZ3h15U2P00WShqTpRWnIg4TeZk6efZ1nZ0vGSaQPXuX6c7380mNzmnDGEb7JvsE0ZTzAuufpEZ2QMP8/HlOCeJqQB32+u8MBN4UHC2YxeniRC/SOmHWcQM80ioUHzrNhwPFcjwIWIRg8hPn2hzOB+CIiKP4qzz2v6WGRlKxr9PKmXnM9lz/xB3dMOYEHBsCJWAnco2Ai1T8VKqAQZo0dt2tabsHFY0s2HozKcbl5FkL2ULQIiSF2zUR/O3g1MDg1UDoJqM7olN8ikFyw6piOJZ5ds9cTFNtmtytXEeA7sIfzJYO9FRpS2X3ortl3FHpHrJK5ptPJi0Egn8kYRIgwGRhTH0YMRIXH6dmJFE5poPj1exJA+A0qfjEN2Max6Nfb8ts4ktBgC2wWV9z2c8yes/Greu1Zr29XsjVSK3uc3HHOoRoxdfJdcR+AbiBY186Vgm1X0jnurdMzeghsm9yBzAbkVmF1JUW7v0m1XoJq7+aloHp9BeX2FdTxGmow3g9VZdU/Xu2RdglqXInQ3SsXQK0DPFZi3Qc4KEfoPkQLlFYROgqrcy88TAxPTgnfrZFUqWN2ZGg903ZVaNZGpVEtRoHFGkRJ3TMdlTXrnsg2htyUpLyXNyAq9cBERGWLulwzGQx2wJTIKC7uiktHidLeypWUeA20BhEl8fqN5AEIHdNtq7bYheTIKdvsOpVK2VSoUS5NSEx71VNcS1Vof5Plba9SaYKfu+p0Vl+6dRV3n3JOI7UyiSebM/L62ZFlWvLSqwN7NTjSVtfw+vDJ796E4jQiMcchYuzSthb8KkMJ9M+QwFGbMrKAYwTNyCUMYxQRdpXJ4z3Xv0xFp57SBO6Y63J22yNDEWPLyAdLOKrmrPxOTNa9u0z5tH+VBfEi5STGaqFpGuqc6uJbQZ8ECcE8XMIA4SQQxzKeowRhDl8UTS8mt5hGEQJlPJ/ll/YVPGk4EVxomB3ZeSEwFQ73w5rZ4tQsQYt5gO+Ksz6lN5ABVpxdH8l5s0hINGU4oVdVDf8ZYv9C5gWpsmHZcurvtKwBhU1YZzqqgtZYiq15byjLHZY2YjatsyybAuBO0y3bxTQmfJcAsuVUbOslEJ2lDrY+sFGVwk4RRtt85Ym1j8Og3nqHvuOibXsPcq1pPey4RMHdB1vm6+4FruZO9Ezj9Zta662MnIRpAPUXQ8HXWG4ZDgMkPjl/zkUzs8utqzoYOabdMa0f3aOX/VVbe6p9rn2hvdBsrae91L7Xhtq5hht/NP5q/N34pzVr/dL6tfWbMn38aOXzmVa4Wn/+B5czF8c=</latexit>
System ID
Acoustic
model
Acoustic
feature
Excit.
signal
Wave.
model
Pitch mismatch MOS
(mean)
note chord
Natural - - - - - - 4.04
Fluidsynth Sample-based MIDI-to-audio software 5.20 6.77 3.66
Pianoteq Physical-model MIDI-to-audio software 4.82 6.50 4.25
abs-mfb-sin - midi-fb sine NSF - - 3.87
abs-mfb-noi - midi-fb noise NSF - - 3.77
abs-mel-sin - mel-spc. sine NSF - - 2.72
abs-mel-noi - mel-spc. noise NSF - - 3.81
taco2-mfb-sin taco2 midi-fb sine NSF 4.61 6.34 2.97
taco2-mfb-noi taco2 midi-fb noise NSF 4.66 6.36 3.18
taco3-mfb-sin taco3 midi-fb sine NSF 4.78 6.48 3.19
taco3-mfb-noi taco3 midi-fb noise NSF 4.89 6.53 3.19
taco4-mfb-sin taco4 midi-fb sine NSF 4.86 6.39 2.98
taco4-mfb-noi taco4 midi-fb noise NSF 4.97 6.42 2.95
pfnet-mfb-sin PFNet midi-fb sine NSF 5.59 7.14 3.10
pfnet-mfb-noi PFNet midi-fb noise NSF 5.78 7.26 3.05
pfnet-mel-sin PFNet mel-spec. sine NSF 5.66 7.17 1.82
pfnet-mel-noi PFNet mel-spec. noise NSF 5.74 7.25 2.93
pfnet-spec-GL PFNet spec. - GL 5.43 6.98 1.62
midi-sin-nsf - - sine NSF 4.32 6.40 2.88
midi-noi-nsf - - noise NSF 4.40 6.08 2.63
Copy-synthesis
MIDI-to-audio
Wav
NSF
MIDI
API
Sine
Wav
NSF
Noise
Natural
Natural
Wav
Acoustic
model
NSF
MIDI
API
Sine
Wav
Acoustic
model
NSF
MIDI
API
Noise
16
Experiments
q Listening test: crowd-sourcing, 224 amateur participants
Natural
Fluidsynth
Pianoteq
abs-mfb-sin
abs-mfb-noi
abs-mel-sin
abs-mel-noi
taco2-mfb-sin
taco2-mfb-noi
taco3-mfb-sin
taco3-mfb-noi
taco4-mfb-sin
taco4-mfb-noi
pfnet-mfb-sin
pfnet-mfb-noi
pfnet-mel-sin
pfnet-mel-noi
pfnet-spec-GL
midi-sin-nsf
midi-noi-nsf
1
2
3
4
5
Quality
(MOS)
MIDI spectrogram Mel spectrogram Other / N/A
17
Experiments
Mann-whitey-U, Holm-Boferroni correction: “*” statistical significance at alpha=0.05, “-” otherwise
q Listening test: crowd-sourcing, 224 amateur participants
Natural
Fluidsynth
Pianoteq
abs-mfb-sin
abs-mfb-noi
abs-mel-sin
abs-mel-noi
taco2-mfb-sin
taco2-mfb-noi
taco3-mfb-sin
taco3-mfb-noi
taco4-mfb-sin
taco4-mfb-noi
pfnet-mfb-sin
pfnet-mfb-noi
pfnet-mel-sin
pfnet-mel-noi
pfnet-spec-GL
midi-sin-nsf
midi-noi-nsf
1
2
3
4
5
Quality
(MOS)
MIDI spectrogram Mel spectrogram Other / N/A
*
-
<latexit sha1_base64="QvEzsRPQOsJO2uPs95Rt4gD1HG8=">AAAM5XiclVbdbuNEFPYuC9mGAFu45MaiYrUg2bId56d3i1i6RNotQaVppLqqxpNJYtX2BM9428jyI3CHkOCGV+EW3oC34cxMksY/SYVTy9PxOd985ztzzthfhAHjlvXvo8fvPXn/g8bTg+aHrY8+/uTZ4acjRtMEk3NMQ5qMfcRIGMTknAc8JONFQlDkh+TCv/lWvL94RxIW0PgnvlyQqwjN4mAaYMRh6vqw0fZ8MgvijCM/DVGSZ3h15U2P00WShqTpRWnIg4TeZk6efZ1nZ0vGSaQPXuX6c7380mNzmnDGEb7JvsE0ZTzAuufpEZ2QMP8/HlOCeJqQB32+u8MBN4UHC2YxeniRC/SOmHWcQM80ioUHzrNhwPFcjwIWIRg8hPn2hzOB+CIiKP4qzz2v6WGRlKxr9PKmXnM9lz/xB3dMOYEHBsCJWAnco2Ai1T8VKqAQZo0dt2tabsHFY0s2HozKcbl5FkL2ULQIiSF2zUR/O3g1MDg1UDoJqM7olN8ikFyw6piOJZ5ds9cTFNtmtytXEeA7sIfzJYO9FRpS2X3ortl3FHpHrJK5ptPJi0Egn8kYRIgwGRhTH0YMRIXH6dmJFE5poPj1exJA+A0qfjEN2Max6Nfb8ts4ktBgC2wWV9z2c8yes/Greu1Zr29XsjVSK3uc3HHOoRoxdfJdcR+AbiBY186Vgm1X0jnurdMzeghsm9yBzAbkVmF1JUW7v0m1XoJq7+aloHp9BeX2FdTxGmow3g9VZdU/Xu2RdglqXInQ3SsXQK0DPFZi3Qc4KEfoPkQLlFYROgqrcy88TAxPTgnfrZFUqWN2ZGg903ZVaNZGpVEtRoHFGkRJ3TMdlTXrnsg2htyUpLyXNyAq9cBERGWLulwzGQx2wJTIKC7uiktHidLeypWUeA20BhEl8fqN5AEIHdNtq7bYheTIKdvsOpVK2VSoUS5NSEx71VNcS1Vof5Plba9SaYKfu+p0Vl+6dRV3n3JOI7UyiSebM/L62ZFlWvLSqwN7NTjSVtfw+vDJ796E4jQiMcchYuzSthb8KkMJ9M+QwFGbMrKAYwTNyCUMYxQRdpXJ4z3Xv0xFp57SBO6Y63J22yNDEWPLyAdLOKrmrPxOTNa9u0z5tH+VBfEi5STGaqFpGuqc6uJbQZ8ECcE8XMIA4SQQxzKeowRhDl8UTS8mt5hGEQJlPJ/ll/YVPGk4EVxomB3ZeSEwFQ73w5rZ4tQsQYt5gO+Ksz6lN5ABVpxdH8l5s0hINGU4oVdVDf8ZYv9C5gWpsmHZcurvtKwBhU1YZzqqgtZYiq15byjLHZY2YjatsyybAuBO0y3bxTQmfJcAsuVUbOslEJ2lDrY+sFGVwk4RRtt85Ym1j8Og3nqHvuOibXsPcq1pPey4RMHdB1vm6+4FruZO9Ezj9Zta662MnIRpAPUXQ8HXWG4ZDgMkPjl/zkUzs8utqzoYOabdMa0f3aOX/VVbe6p9rn2hvdBsrae91L7Xhtq5hht/NP5q/N34pzVr/dL6tfWbMn38aOXzmVa4Wn/+B5czF8c=</latexit>
System ID
Acoustic
model
Acoustic
feature
Excit.
signal
Wave.
model
Pitch mismatch MOS
(mean
note chord
Natural - - - - - - 4.04
Fluidsynth Sample-based MIDI-to-audio software 5.20 6.77 3.66
Pianoteq Physical-model MIDI-to-audio software 4.82 6.50 4.25
abs-mfb-sin - midi-fb sine NSF - - 3.87
abs-mfb-noi - midi-fb noise NSF - - 3.77
abs-mel-sin - mel-spc. sine NSF - - 2.72
abs-mel-noi - mel-spc. noise NSF - - 3.81
taco2-mfb-sin taco2 midi-fb sine NSF 4.61 6.34 2.97
taco2-mfb-noi taco2 midi-fb noise NSF 4.66 6.36 3.18
taco3-mfb-sin taco3 midi-fb sine NSF 4.78 6.48 3.19
taco3-mfb-noi taco3 midi-fb noise NSF 4.89 6.53 3.19
taco4-mfb-sin taco4 midi-fb sine NSF 4.86 6.39 2.98
taco4-mfb-noi taco4 midi-fb noise NSF 4.97 6.42 2.95
pfnet-mfb-sin PFNet midi-fb sine NSF 5.59 7.14 3.10
pfnet-mfb-noi PFNet midi-fb noise NSF 5.78 7.26 3.05
pfnet-mel-sin PFNet mel-spec. sine NSF 5.66 7.17 1.82
pfnet-mel-noi PFNet mel-spec. noise NSF 5.74 7.25 2.93
pfnet-spec-GL PFNet spec. - GL 5.43 6.98 1.62
18
Experiments
Mann-whitey-U, Holm-Boferroni correction: “*” statistical significance at alpha=0.05, “-” otherwise
q Listening test: crowd-sourcing, 224 amateur participants
§ Neural waveform model works well in copy-synthesis condition
Natural
Fluidsynth
Pianoteq
abs-mfb-sin
abs-mfb-noi
abs-mel-sin
abs-mel-noi
taco2-mfb-sin
taco2-mfb-noi
taco3-mfb-sin
taco3-mfb-noi
taco4-mfb-sin
taco4-mfb-noi
pfnet-mfb-sin
pfnet-mfb-noi
pfnet-mel-sin
pfnet-mel-noi
pfnet-spec-GL
midi-sin-nsf
midi-noi-nsf
1
2
3
4
5
Quality
(MOS)
MIDI spectrogram Mel spectrogram Other / N/A
<latexit sha1_base64="QvEzsRPQOsJO2uPs95Rt4gD1HG8=">AAAM5XiclVbdbuNEFPYuC9mGAFu45MaiYrUg2bId56d3i1i6RNotQaVppLqqxpNJYtX2BM9428jyI3CHkOCGV+EW3oC34cxMksY/SYVTy9PxOd985ztzzthfhAHjlvXvo8fvPXn/g8bTg+aHrY8+/uTZ4acjRtMEk3NMQ5qMfcRIGMTknAc8JONFQlDkh+TCv/lWvL94RxIW0PgnvlyQqwjN4mAaYMRh6vqw0fZ8MgvijCM/DVGSZ3h15U2P00WShqTpRWnIg4TeZk6efZ1nZ0vGSaQPXuX6c7380mNzmnDGEb7JvsE0ZTzAuufpEZ2QMP8/HlOCeJqQB32+u8MBN4UHC2YxeniRC/SOmHWcQM80ioUHzrNhwPFcjwIWIRg8hPn2hzOB+CIiKP4qzz2v6WGRlKxr9PKmXnM9lz/xB3dMOYEHBsCJWAnco2Ai1T8VKqAQZo0dt2tabsHFY0s2HozKcbl5FkL2ULQIiSF2zUR/O3g1MDg1UDoJqM7olN8ikFyw6piOJZ5ds9cTFNtmtytXEeA7sIfzJYO9FRpS2X3ortl3FHpHrJK5ptPJi0Egn8kYRIgwGRhTH0YMRIXH6dmJFE5poPj1exJA+A0qfjEN2Max6Nfb8ts4ktBgC2wWV9z2c8yes/Greu1Zr29XsjVSK3uc3HHOoRoxdfJdcR+AbiBY186Vgm1X0jnurdMzeghsm9yBzAbkVmF1JUW7v0m1XoJq7+aloHp9BeX2FdTxGmow3g9VZdU/Xu2RdglqXInQ3SsXQK0DPFZi3Qc4KEfoPkQLlFYROgqrcy88TAxPTgnfrZFUqWN2ZGg903ZVaNZGpVEtRoHFGkRJ3TMdlTXrnsg2htyUpLyXNyAq9cBERGWLulwzGQx2wJTIKC7uiktHidLeypWUeA20BhEl8fqN5AEIHdNtq7bYheTIKdvsOpVK2VSoUS5NSEx71VNcS1Vof5Plba9SaYKfu+p0Vl+6dRV3n3JOI7UyiSebM/L62ZFlWvLSqwN7NTjSVtfw+vDJ796E4jQiMcchYuzSthb8KkMJ9M+QwFGbMrKAYwTNyCUMYxQRdpXJ4z3Xv0xFp57SBO6Y63J22yNDEWPLyAdLOKrmrPxOTNa9u0z5tH+VBfEi5STGaqFpGuqc6uJbQZ8ECcE8XMIA4SQQxzKeowRhDl8UTS8mt5hGEQJlPJ/ll/YVPGk4EVxomB3ZeSEwFQ73w5rZ4tQsQYt5gO+Ksz6lN5ABVpxdH8l5s0hINGU4oVdVDf8ZYv9C5gWpsmHZcurvtKwBhU1YZzqqgtZYiq15byjLHZY2YjatsyybAuBO0y3bxTQmfJcAsuVUbOslEJ2lDrY+sFGVwk4RRtt85Ym1j8Og3nqHvuOibXsPcq1pPey4RMHdB1vm6+4FruZO9Ezj9Zta662MnIRpAPUXQ8HXWG4ZDgMkPjl/zkUzs8utqzoYOabdMa0f3aOX/VVbe6p9rn2hvdBsrae91L7Xhtq5hht/NP5q/N34pzVr/dL6tfWbMn38aOXzmVa4Wn/+B5czF8c=</latexit>
System ID
Acoustic
model
Acoustic
feature
Excit.
signal
Wave.
model
Pitch mismatch MOS
(mean
note chord
Natural - - - - - - 4.04
Fluidsynth Sample-based MIDI-to-audio software 5.20 6.77 3.66
Pianoteq Physical-model MIDI-to-audio software 4.82 6.50 4.25
abs-mfb-sin - midi-fb sine NSF - - 3.87
abs-mfb-noi - midi-fb noise NSF - - 3.77
abs-mel-sin - mel-spc. sine NSF - - 2.72
abs-mel-noi - mel-spc. noise NSF - - 3.81
taco2-mfb-sin taco2 midi-fb sine NSF 4.61 6.34 2.97
taco2-mfb-noi taco2 midi-fb noise NSF 4.66 6.36 3.18
taco3-mfb-sin taco3 midi-fb sine NSF 4.78 6.48 3.19
taco3-mfb-noi taco3 midi-fb noise NSF 4.89 6.53 3.19
taco4-mfb-sin taco4 midi-fb sine NSF 4.86 6.39 2.98
taco4-mfb-noi taco4 midi-fb noise NSF 4.97 6.42 2.95
pfnet-mfb-sin PFNet midi-fb sine NSF 5.59 7.14 3.10
pfnet-mfb-noi PFNet midi-fb noise NSF 5.78 7.26 3.05
pfnet-mel-sin PFNet mel-spec. sine NSF 5.66 7.17 1.82
pfnet-mel-noi PFNet mel-spec. noise NSF 5.74 7.25 2.93
pfnet-spec-GL PFNet spec. - GL 5.43 6.98 1.62
*
*
19
Experiments
q Listening test: crowd-sourcing, 224 amateur participants
Natural
Fluidsynth
Pianoteq
abs-mfb-sin
abs-mfb-noi
abs-mel-sin
abs-mel-noi
taco2-mfb-sin
taco2-mfb-noi
taco3-mfb-sin
taco3-mfb-noi
taco4-mfb-sin
taco4-mfb-noi
pfnet-mfb-sin
pfnet-mfb-noi
pfnet-mel-sin
pfnet-mel-noi
pfnet-spec-GL
midi-sin-nsf
midi-noi-nsf
1
2
3
4
5
Quality
(MOS)
MIDI spectrogram Mel spectrogram Other / N/A
<latexit sha1_base64="QvEzsRPQOsJO2uPs95Rt4gD1HG8=">AAAM5XiclVbdbuNEFPYuC9mGAFu45MaiYrUg2bId56d3i1i6RNotQaVppLqqxpNJYtX2BM9428jyI3CHkOCGV+EW3oC34cxMksY/SYVTy9PxOd985ztzzthfhAHjlvXvo8fvPXn/g8bTg+aHrY8+/uTZ4acjRtMEk3NMQ5qMfcRIGMTknAc8JONFQlDkh+TCv/lWvL94RxIW0PgnvlyQqwjN4mAaYMRh6vqw0fZ8MgvijCM/DVGSZ3h15U2P00WShqTpRWnIg4TeZk6efZ1nZ0vGSaQPXuX6c7380mNzmnDGEb7JvsE0ZTzAuufpEZ2QMP8/HlOCeJqQB32+u8MBN4UHC2YxeniRC/SOmHWcQM80ioUHzrNhwPFcjwIWIRg8hPn2hzOB+CIiKP4qzz2v6WGRlKxr9PKmXnM9lz/xB3dMOYEHBsCJWAnco2Ai1T8VKqAQZo0dt2tabsHFY0s2HozKcbl5FkL2ULQIiSF2zUR/O3g1MDg1UDoJqM7olN8ikFyw6piOJZ5ds9cTFNtmtytXEeA7sIfzJYO9FRpS2X3ortl3FHpHrJK5ptPJi0Egn8kYRIgwGRhTH0YMRIXH6dmJFE5poPj1exJA+A0qfjEN2Max6Nfb8ts4ktBgC2wWV9z2c8yes/Greu1Zr29XsjVSK3uc3HHOoRoxdfJdcR+AbiBY186Vgm1X0jnurdMzeghsm9yBzAbkVmF1JUW7v0m1XoJq7+aloHp9BeX2FdTxGmow3g9VZdU/Xu2RdglqXInQ3SsXQK0DPFZi3Qc4KEfoPkQLlFYROgqrcy88TAxPTgnfrZFUqWN2ZGg903ZVaNZGpVEtRoHFGkRJ3TMdlTXrnsg2htyUpLyXNyAq9cBERGWLulwzGQx2wJTIKC7uiktHidLeypWUeA20BhEl8fqN5AEIHdNtq7bYheTIKdvsOpVK2VSoUS5NSEx71VNcS1Vof5Plba9SaYKfu+p0Vl+6dRV3n3JOI7UyiSebM/L62ZFlWvLSqwN7NTjSVtfw+vDJ796E4jQiMcchYuzSthb8KkMJ9M+QwFGbMrKAYwTNyCUMYxQRdpXJ4z3Xv0xFp57SBO6Y63J22yNDEWPLyAdLOKrmrPxOTNa9u0z5tH+VBfEi5STGaqFpGuqc6uJbQZ8ECcE8XMIA4SQQxzKeowRhDl8UTS8mt5hGEQJlPJ/ll/YVPGk4EVxomB3ZeSEwFQ73w5rZ4tQsQYt5gO+Ksz6lN5ABVpxdH8l5s0hINGU4oVdVDf8ZYv9C5gWpsmHZcurvtKwBhU1YZzqqgtZYiq15byjLHZY2YjatsyybAuBO0y3bxTQmfJcAsuVUbOslEJ2lDrY+sFGVwk4RRtt85Ym1j8Og3nqHvuOibXsPcq1pPey4RMHdB1vm6+4FruZO9Ezj9Zta662MnIRpAPUXQ8HXWG4ZDgMkPjl/zkUzs8utqzoYOabdMa0f3aOX/VVbe6p9rn2hvdBsrae91L7Xhtq5hht/NP5q/N34pzVr/dL6tfWbMn38aOXzmVa4Wn/+B5czF8c=</latexit>
System ID
Acoustic
model
Acoustic
feature
Excit.
signal
Wave.
model
P
n
Natural - - - -
Fluidsynth Sample-based MIDI-to-audio software 5
Pianoteq Physical-model MIDI-to-audio software 4
abs-mfb-sin - midi-fb sine NSF
abs-mfb-noi - midi-fb noise NSF
abs-mel-sin - mel-spc. sine NSF
abs-mel-noi - mel-spc. noise NSF
taco2-mfb-sin taco2 midi-fb sine NSF 4
taco2-mfb-noi taco2 midi-fb noise NSF 4
taco3-mfb-sin taco3 midi-fb sine NSF 4
taco3-mfb-noi taco3 midi-fb noise NSF 4
taco4-mfb-sin taco4 midi-fb sine NSF 4
taco4-mfb-noi taco4 midi-fb noise NSF 4
pfnet-mfb-sin PFNet midi-fb sine NSF 5
pfnet-mfb-noi PFNet midi-fb noise NSF 5
pfnet-mel-sin PFNet mel-spec. sine NSF 5
pfnet-mel-noi PFNet mel-spec. noise NSF 5
pfnet-spec-GL PFNet spec. - GL 5
• Taco2: 800-frame segments; 4x downsampling for better alignments
• Taco3: Warm-started from taco2; input current piano-roll frame at the decoder pre-net
• Taco4: Warm-started from taco2; no downsampling or piano-roll input to pre-net
*
*
20
Experiments
q Listening test: crowd-sourcing, 224 amateur participants
q Pitch distortion
§ Natural audio as target
§ Evaluated on single notes
§ The lower the better
Natural
Fluidsynth
Pianoteq
abs-mfb-sin
abs-mfb-noi
abs-mel-sin
abs-mel-noi
taco2-mfb-sin
taco2-mfb-noi
taco3-mfb-sin
taco3-mfb-noi
taco4-mfb-sin
taco4-mfb-noi
pfnet-mfb-sin
pfnet-mfb-noi
pfnet-mel-sin
pfnet-mel-noi
pfnet-spec-GL
midi-sin-nsf
midi-noi-nsf
1
2
3
4
5
Quality
(MOS)
MIDI spectrogram Mel spectrogram Other / N/A
4
4.2
4.4
4.6
4.8
5
5.2
5.4
5.6
5.8
6
21
Messages
q TTS & MIDI-to-audio
§ Techniques can be shared: acoustic model, waveform model
§ Performance bottleneck on acoustic model (and waveform model)
q On waveform modeling
§ Physical-model performs well but lacks reverberation effect
§ Sample-based model replies on the sample database
§ Non-AR waveform model is OK in copy-synthesis
• Reverberation is captured
• Noise excitation is OK
22
Messages
q On acoustic model
§ Obtaining good alignments for longer input sequences is challenging
§ Inputting the piano-roll frame to the decoder prenet helps improve alignments
• Acceptable for perfectly-aligned performance-MIDI
• Have to consider other strategies for non-aligned score-MIDI
23
Thank you!
Tacotron code: https://github.com/nii-yamagishilab/self-attention-tacotron
NSF code: https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts
Samples: https://nii-yamagishilab.github.io/samples-xin/main-midi2audio.html
25
Appendix
q On MIDI filterbank
Center of MIDI filter bank
Mel filter banks
26
Appendix
q On MIDI filterbank
Center of MIDI filter bank
Mel filter banks
27
Appendix
q On MIDI filterbank
Mel-spectrogram MIDI-centered filter-bank CQT
28
Appendix
q On MIDI filterbank
Natural
Mel-
spectrogram
to wave
MIDI-
centered
filter-bank
CQT
Short
clip 1
Short
clip 2
Short
clip 3
Short
clip 4
29
Appendix
https://www.fluidsynth.org/
https://www.modartt.com/pianoteq
q Models in comparison
<latexit sha1_base64="QvEzsRPQOsJO2uPs95Rt4gD1HG8=">AAAM5XiclVbdbuNEFPYuC9mGAFu45MaiYrUg2bId56d3i1i6RNotQaVppLqqxpNJYtX2BM9428jyI3CHkOCGV+EW3oC34cxMksY/SYVTy9PxOd985ztzzthfhAHjlvXvo8fvPXn/g8bTg+aHrY8+/uTZ4acjRtMEk3NMQ5qMfcRIGMTknAc8JONFQlDkh+TCv/lWvL94RxIW0PgnvlyQqwjN4mAaYMRh6vqw0fZ8MgvijCM/DVGSZ3h15U2P00WShqTpRWnIg4TeZk6efZ1nZ0vGSaQPXuX6c7380mNzmnDGEb7JvsE0ZTzAuufpEZ2QMP8/HlOCeJqQB32+u8MBN4UHC2YxeniRC/SOmHWcQM80ioUHzrNhwPFcjwIWIRg8hPn2hzOB+CIiKP4qzz2v6WGRlKxr9PKmXnM9lz/xB3dMOYEHBsCJWAnco2Ai1T8VKqAQZo0dt2tabsHFY0s2HozKcbl5FkL2ULQIiSF2zUR/O3g1MDg1UDoJqM7olN8ikFyw6piOJZ5ds9cTFNtmtytXEeA7sIfzJYO9FRpS2X3ortl3FHpHrJK5ptPJi0Egn8kYRIgwGRhTH0YMRIXH6dmJFE5poPj1exJA+A0qfjEN2Max6Nfb8ts4ktBgC2wWV9z2c8yes/Greu1Zr29XsjVSK3uc3HHOoRoxdfJdcR+AbiBY186Vgm1X0jnurdMzeghsm9yBzAbkVmF1JUW7v0m1XoJq7+aloHp9BeX2FdTxGmow3g9VZdU/Xu2RdglqXInQ3SsXQK0DPFZi3Qc4KEfoPkQLlFYROgqrcy88TAxPTgnfrZFUqWN2ZGg903ZVaNZGpVEtRoHFGkRJ3TMdlTXrnsg2htyUpLyXNyAq9cBERGWLulwzGQx2wJTIKC7uiktHidLeypWUeA20BhEl8fqN5AEIHdNtq7bYheTIKdvsOpVK2VSoUS5NSEx71VNcS1Vof5Plba9SaYKfu+p0Vl+6dRV3n3JOI7UyiSebM/L62ZFlWvLSqwN7NTjSVtfw+vDJ796E4jQiMcchYuzSthb8KkMJ9M+QwFGbMrKAYwTNyCUMYxQRdpXJ4z3Xv0xFp57SBO6Y63J22yNDEWPLyAdLOKrmrPxOTNa9u0z5tH+VBfEi5STGaqFpGuqc6uJbQZ8ECcE8XMIA4SQQxzKeowRhDl8UTS8mt5hGEQJlPJ/ll/YVPGk4EVxomB3ZeSEwFQ73w5rZ4tQsQYt5gO+Ksz6lN5ABVpxdH8l5s0hINGU4oVdVDf8ZYv9C5gWpsmHZcurvtKwBhU1YZzqqgtZYiq15byjLHZY2YjatsyybAuBO0y3bxTQmfJcAsuVUbOslEJ2lDrY+sFGVwk4RRtt85Ym1j8Og3nqHvuOibXsPcq1pPey4RMHdB1vm6+4FruZO9Ezj9Zta662MnIRpAPUXQ8HXWG4ZDgMkPjl/zkUzs8utqzoYOabdMa0f3aOX/VVbe6p9rn2hvdBsrae91L7Xhtq5hht/NP5q/N34pzVr/dL6tfWbMn38aOXzmVa4Wn/+B5czF8c=</latexit>
System ID
Acoustic
model
Acoustic
feature
Excit.
signal
Wave.
model
Pitch mismatch MOS
(mean)
note chord
Natural - - - - - - 4.04
Fluidsynth Sample-based MIDI-to-audio software 5.20 6.77 3.66
Pianoteq Physical-model MIDI-to-audio software 4.82 6.50 4.25
abs-mfb-sin - midi-fb sine NSF - - 3.87
abs-mfb-noi - midi-fb noise NSF - - 3.77
abs-mel-sin - mel-spc. sine NSF - - 2.72
abs-mel-noi - mel-spc. noise NSF - - 3.81
taco2-mfb-sin taco2 midi-fb sine NSF 4.61 6.34 2.97
taco2-mfb-noi taco2 midi-fb noise NSF 4.66 6.36 3.18
taco3-mfb-sin taco3 midi-fb sine NSF 4.78 6.48 3.19
taco3-mfb-noi taco3 midi-fb noise NSF 4.89 6.53 3.19
taco4-mfb-sin taco4 midi-fb sine NSF 4.86 6.39 2.98
taco4-mfb-noi taco4 midi-fb noise NSF 4.97 6.42 2.95
pfnet-mfb-sin PFNet midi-fb sine NSF 5.59 7.14 3.10
pfnet-mfb-noi PFNet midi-fb noise NSF 5.78 7.26 3.05
pfnet-mel-sin PFNet mel-spec. sine NSF 5.66 7.17 1.82
pfnet-mel-noi PFNet mel-spec. noise NSF 5.74 7.25 2.93
pfnet-spec-GL PFNet spec. - GL 5.43 6.98 1.62
midi-sin-nsf - - sine NSF 4.32 6.40 2.88
midi-noi-nsf - - noise NSF 4.40 6.08 2.63
MIDI-to-audio
Wav
Acoustic
model
NSF
MIDI
API
Sine
Wav
Acoustic
model
NSF
MIDI
API
Noise
MIDI-to-audio
Wav
NSF
MIDI
API
Sine
Wav
NSF
MIDI
API
Noise
30
Appendix
https://www.fluidsynth.org/
https://www.modartt.com/pianoteq
<latexit sha1_base64="QvEzsRPQOsJO2uPs95Rt4gD1HG8=">AAAM5XiclVbdbuNEFPYuC9mGAFu45MaiYrUg2bId56d3i1i6RNotQaVppLqqxpNJYtX2BM9428jyI3CHkOCGV+EW3oC34cxMksY/SYVTy9PxOd985ztzzthfhAHjlvXvo8fvPXn/g8bTg+aHrY8+/uTZ4acjRtMEk3NMQ5qMfcRIGMTknAc8JONFQlDkh+TCv/lWvL94RxIW0PgnvlyQqwjN4mAaYMRh6vqw0fZ8MgvijCM/DVGSZ3h15U2P00WShqTpRWnIg4TeZk6efZ1nZ0vGSaQPXuX6c7380mNzmnDGEb7JvsE0ZTzAuufpEZ2QMP8/HlOCeJqQB32+u8MBN4UHC2YxeniRC/SOmHWcQM80ioUHzrNhwPFcjwIWIRg8hPn2hzOB+CIiKP4qzz2v6WGRlKxr9PKmXnM9lz/xB3dMOYEHBsCJWAnco2Ai1T8VKqAQZo0dt2tabsHFY0s2HozKcbl5FkL2ULQIiSF2zUR/O3g1MDg1UDoJqM7olN8ikFyw6piOJZ5ds9cTFNtmtytXEeA7sIfzJYO9FRpS2X3ortl3FHpHrJK5ptPJi0Egn8kYRIgwGRhTH0YMRIXH6dmJFE5poPj1exJA+A0qfjEN2Max6Nfb8ts4ktBgC2wWV9z2c8yes/Greu1Zr29XsjVSK3uc3HHOoRoxdfJdcR+AbiBY186Vgm1X0jnurdMzeghsm9yBzAbkVmF1JUW7v0m1XoJq7+aloHp9BeX2FdTxGmow3g9VZdU/Xu2RdglqXInQ3SsXQK0DPFZi3Qc4KEfoPkQLlFYROgqrcy88TAxPTgnfrZFUqWN2ZGg903ZVaNZGpVEtRoHFGkRJ3TMdlTXrnsg2htyUpLyXNyAq9cBERGWLulwzGQx2wJTIKC7uiktHidLeypWUeA20BhEl8fqN5AEIHdNtq7bYheTIKdvsOpVK2VSoUS5NSEx71VNcS1Vof5Plba9SaYKfu+p0Vl+6dRV3n3JOI7UyiSebM/L62ZFlWvLSqwN7NTjSVtfw+vDJ796E4jQiMcchYuzSthb8KkMJ9M+QwFGbMrKAYwTNyCUMYxQRdpXJ4z3Xv0xFp57SBO6Y63J22yNDEWPLyAdLOKrmrPxOTNa9u0z5tH+VBfEi5STGaqFpGuqc6uJbQZ8ECcE8XMIA4SQQxzKeowRhDl8UTS8mt5hGEQJlPJ/ll/YVPGk4EVxomB3ZeSEwFQ73w5rZ4tQsQYt5gO+Ksz6lN5ABVpxdH8l5s0hINGU4oVdVDf8ZYv9C5gWpsmHZcurvtKwBhU1YZzqqgtZYiq15byjLHZY2YjatsyybAuBO0y3bxTQmfJcAsuVUbOslEJ2lDrY+sFGVwk4RRtt85Ym1j8Og3nqHvuOibXsPcq1pPey4RMHdB1vm6+4FruZO9Ezj9Zta662MnIRpAPUXQ8HXWG4ZDgMkPjl/zkUzs8utqzoYOabdMa0f3aOX/VVbe6p9rn2hvdBsrae91L7Xhtq5hht/NP5q/N34pzVr/dL6tfWbMn38aOXzmVa4Wn/+B5czF8c=</latexit>
System ID
Acoustic
model
Acoustic
feature
Excit.
signal
Wave.
model
Pitch mismatch MOS
(mean)
note chord
Natural - - - - - - 4.04
Fluidsynth Sample-based MIDI-to-audio software 5.20 6.77 3.66
Pianoteq Physical-model MIDI-to-audio software 4.82 6.50 4.25
abs-mfb-sin - midi-fb sine NSF - - 3.87
abs-mfb-noi - midi-fb noise NSF - - 3.77
abs-mel-sin - mel-spc. sine NSF - - 2.72
abs-mel-noi - mel-spc. noise NSF - - 3.81
taco2-mfb-sin taco2 midi-fb sine NSF 4.61 6.34 2.97
taco2-mfb-noi taco2 midi-fb noise NSF 4.66 6.36 3.18
taco3-mfb-sin taco3 midi-fb sine NSF 4.78 6.48 3.19
taco3-mfb-noi taco3 midi-fb noise NSF 4.89 6.53 3.19
taco4-mfb-sin taco4 midi-fb sine NSF 4.86 6.39 2.98
taco4-mfb-noi taco4 midi-fb noise NSF 4.97 6.42 2.95
pfnet-mfb-sin PFNet midi-fb sine NSF 5.59 7.14 3.10
pfnet-mfb-noi PFNet midi-fb noise NSF 5.78 7.26 3.05
pfnet-mel-sin PFNet mel-spec. sine NSF 5.66 7.17 1.82
pfnet-mel-noi PFNet mel-spec. noise NSF 5.74 7.25 2.93
pfnet-spec-GL PFNet spec. - GL 5.43 6.98 1.62
midi-sin-nsf - - sine NSF 4.32 6.40 2.88
midi-noi-nsf - - noise NSF 4.40 6.08 2.63
q Objective test
§ Lower the better
31
Appendix
q Significance test
§ Grey block: significant difference
§ White block: otherwise
Natural
Fluidsynth
Pianoteq
abs-mfb-sin
abs-mfb-noi
abs-mel-sin
abs-mel-noi
taco2-mfb-sin
taco2-mfb-noi
taco3-mfb-sin
taco3-mfb-noi
taco4-mfb-sin
taco4-mfb-noi
pfnet-mfb-sin
pfnet-mfb-noi
pfnet-mel-sin
pfnet-mel-noi
pfnet-spec-GL
midi-sin-nsf
midi-noi-nsf
Natural
Fluidsynth
Pianoteq
abs-mfb-sin
abs-mfb-noi
abs-mel-sin
abs-mel-noi
taco2-mfb-sin
taco2-mfb-noi
taco3-mfb-sin
taco3-mfb-noi
taco4-mfb-sin
taco4-mfb-noi
pfnet-mfb-sin
pfnet-mfb-noi
pfnet-mel-sin
pfnet-mel-noi
pfnet-spec-GL
midi-sin-nsf
midi-noi-nsf
32
Appendix
Caetano, Marcelo, and Xavier Rodet. "A source-filter model for musical instrument sound transformation." ICASSP. IEEE, 2012.
Klapuri, Anssi, Tuomas Virtanen, and Toni Heittola. "Sound source separation in monaural music signals using excitation-filter model and em algorithm."
ICASSP. IEEE, 2010.
q On music audio and speech waveform
33
Appendix
Caetano, Marcelo, and Xavier Rodet. "A source-filter model for musical instrument sound transformation." ICASSP. IEEE, 2012.
Klapuri, Anssi, Tuomas Virtanen, and Toni Heittola. "Sound source separation in monaural music signals using excitation-filter model and em algorithm."
ICASSP. IEEE, 2010.
q On music audio and speech waveform
34
Appendix
q On music audio and speech waveform

More Related Content

More from Yamagishi Laboratory, National Institute of Informatics, Japan

エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~Yamagishi Laboratory, National Institute of Informatics, Japan
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~Yamagishi Laboratory, National Institute of Informatics, Japan
 

More from Yamagishi Laboratory, National Institute of Informatics, Japan (10)

Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
 
How do Voices from Past Speech Synthesis Challenges Compare Today?
 How do Voices from Past Speech Synthesis Challenges Compare Today? How do Voices from Past Speech Synthesis Challenges Compare Today?
How do Voices from Past Speech Synthesis Challenges Compare Today?
 
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
 
Advancements in Neural Vocoders
Advancements in Neural VocodersAdvancements in Neural Vocoders
Advancements in Neural Vocoders
 
Neural Waveform Modeling
Neural Waveform Modeling Neural Waveform Modeling
Neural Waveform Modeling
 
Neural source-filter waveform model
Neural source-filter waveform modelNeural source-filter waveform model
Neural source-filter waveform model
 
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
 
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
 

Recently uploaded

Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Recently uploaded (20)

Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis

  • 1. Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis Erica Cooper, Xin Wang , Junichi Yamagishi National Institute of Informatics, Japan
  • 2. 2 Motivation .txt Speech recognition Music transcription Language modeling <latexit sha1_base64="/CH3PPEHB2uw/n9QUXfzgmYucMQ=">AAAChnicbZHNbtNAEMc3BkoIUFp65GIRVSqXyEaN2mMkLhyLRNpKsVvNrifJKvth7Y5pI8vvwRXeirdhneRQtx1pd//6zYxmZoeXSnpKkn+96MXLV3uv+28Gb9+93/9wcPjx0tvKCZwKq6y75uBRSYNTkqTwunQImiu84qtvrf/qFzovrflJ6xJzDQsj51IABXRTnmSE91S3V/Pl9mCYjJKNxU9FuhNDtrOL28MezworKo2GhALvZ2lSUl6DIykUNoOs8liCWMECZ0Ea0OjzetN2Ex8HUsRz68IxFG/ow4watPdrzUOkBlr6x74WPuebVTQ/z2tpyorQiG2heaVisnH7B3EhHQpS6yBAOBl6jcUSHAgKPzXIDN4JqzWYos64b2ZpHl6rirYXq+ph2nQG245DXD1Du2jhoFxKcd+l3NoVQSjUobpSJJ29awZhK+njHTwVl19H6XiU/DgdTs53++mzT+wzO2EpO2MT9p1dsCkTzLHf7A/7G/WjUTSOzrahUW+Xc8Q6Fk3+Ax6ryYQ=</latexit> p(text) <latexit sha1_base64="EAl9tgxIUnBZZ5Qql+WkJlo34lI=">AAAClnicbZHNahsxEMflbdqm7pfTXgLNYYkppBezG1qSUwmU0hxdiJOAdzGSdmwL6wtpNonZ7qVPk2v6Nn2bam0fukkGJP35zQwzo2FWCo9J8rcTPdl6+uz59ovuy1ev37zt7bw796Z0HEbcSOMuGfUghYYRCpRwaR1QxSRcsMW3xn9xBc4Lo89waSFXdKbFVHCKAU16e/YgQ7jBqrnqX2vtLQCf158mvX4ySFYWPxTpRvTJxoaTnQ7LCsNLBRq5pN6P08RiXlGHgkuou1npwVK+oDMYB6mpAp9XqzHq+GMgRTw1LhyN8Yr+n1FR5f1SsRCpKM79fV8DH/ONS5we55XQtkTQfF1oWsoYTdz8SVwIBxzlMgjKnQi9xnxOHeUYfq6babjmRimqiypjvh6neXiNLJpejKz6ad0abD0OMvkIbaOZo3Yu+E2bMmMWSEOhFlWlROHMdd0NW0nv7+ChOD8cpF8Gyc/P/ZPjzX62yQeyTw5ISo7ICTklQzIinPwmt+SO/Il2o6/R9+jHOjTqbHLek5ZFw3/MD9BO</latexit> p(text|speech) Music language modeling <latexit sha1_base64="pHfXu2ZEYiI4q/eUfRR0x2p1KvU=">AAACiXicbZFLT9tAEMc3hhYa2hLKsReLqBK9RHbVqhEnJC49gtQAUmxFs+tJsso+rN1xIbL8SbjSD9Vvw+ZxqIGRVvvXb2Y0L14q6SlJ/nWind03b/f233UP3n/4eNg7+nTtbeUEjoRV1t1y8KikwRFJUnhbOgTNFd7wxcXKf/MHnZfW/KZlibmGmZFTKYACmvQOy9OM8J5qYwl983XS6yeDZG3xS5FuRZ9t7XJy1OFZYUWl0ZBQ4P04TUrKa3AkhcKmm1UeSxALmOE4SAMafV6vO2/iL4EU8dS68AzFa/p/Rg3a+6XmIVIDzf1z3wq+5htXNB3mtTRlRWjEptC0UjHZeLWGuJAOBallECCcDL3GYg4OBIVldTODd8JqDaaoM+6bcZqH36pi1YtVdT9tWoNtxiGuXqFtNHNQzqW4b1Nu7YIgFGpRXSmSzt413XCV9PkNXorrb4P0xyC5+t4/H27vs88+sxN2ylL2k52zX+ySjZhgFXtgj+xvdBCl0TA624RGnW3OMWtZdPEEo3HKIw==</latexit> p(notes) <latexit sha1_base64="i8J00zLt+oTu0HMgV0D/FkfOs04=">AAAClnicbZFLaxsxEMfl7St1X05zKbSHpaaQXsxuaElOJVBCenShTgLexYy0Y1tYj0WaTWK2e+mn6bX9Nv02kR+HbpIBob9+M8PMaHippKck+deJHjx89PjJztPus+cvXr7q7b4+87ZyAkfCKusuOHhU0uCIJCm8KB2C5grP+eLryn9+ic5La37QssRcw8zIqRRAAU1678r9jPCaamMJffNz84CqkLb5OOn1k0GytviuSLeiz7Y2nOx2eFZYUWk0JBR4P06TkvIaHEmhsOlmlccSxAJmOA7SgEaf1+sxmvhDIEU8tS4cQ/Ga/p9Rg/Z+qXmI1EBzf9u3gvf5xhVNj/JamrIiNGJTaFqpmGy8+pO4kA4FqWUQIJwMvcZiDg4EhZ/rZgavhNUaTFFn3DfjNA+3VcWqF6vqftq0BtuMQ1zdQ9to5qCcS3HdptzaBUEo1KK6UiSdvWq6YSvp7R3cFWcHg/TzIPn+qX98tN3PDnvL3rN9lrJDdsy+sSEbMcF+sd/sD/sbvYm+RCfR6SY06mxz9ljLouENxwjQTA==</latexit> p(notes|audio) TTS <latexit sha1_base64="AdtZH2eD4foaVT6LJ+S3wTTUAPQ=">AAAClnicbZHNahsxEMflbdqm7pfTXgLNYYkppBezG1qSUwmU0hxdiJOAdzGSdmwL6wtpNonZ7qVPk2v6Nn2bam0fukkGJP31mxlGo2FWCo9J8rcTPdl6+uz59ovuy1ev37zt7bw796Z0HEbcSOMuGfUghYYRCpRwaR1QxSRcsMW3xn9xBc4Lo89waSFXdKbFVHCKAU16e/YgQ7jBylsAPq9/rW/NVn+a9PrJIFlZ/FCkG9EnGxtOdjosKwwvFWjkkno/ThOLeUUdCi6h7malB0v5gs5gHKSmCnxerdqo44+BFPHUuLA0xiv6f0ZFlfdLxUKkojj3930NfMw3LnF6nFdC2xJB83WhaSljNHHzJ3EhHHCUyyAodyK8NeZz6ijH8HPdTMM1N0pRXVQZ8/U4zcNpZNG8xciqn9atxtbtIJOP0DaaOWrngt+0KTNmgTQUalFVShTOXNfdMJX0/gweivPDQfplkPz83D853sxnm3wg++SApOSInJBTMiQjwslvckvuyJ9oN/oafY9+rEOjzibnPWlZNPwHyb/QTg==</latexit> p(speech|text) <latexit sha1_base64="W5X5+Q/S5r7xUsdooj/ITPBr7TI=">AAAClnicbZFLaxsxEMfl7St1X05zKbSHpaaQXsxuaElOJVBCenShTgLexYy0Y1tYj0WaTWK2e+mn6bX9Nv02kR+HbpIBob9+M8PMaHippKck+deJHjx89PjJztPus+cvXr7q7b4+87ZyAkfCKusuOHhU0uCIJCm8KB2C5grP+eLryn9+ic5La37QssRcw8zIqRRAAU1678r9jPCaaqgKaZufm4exhL75OOn1k0GytviuSLeiz7Y2nOx2eFZYUWk0JBR4P06TkvIaHEmhsOlmlccSxAJmOA7SgEaf1+sxmvhDIEU8tS4cQ/Ga/p9Rg/Z+qXmI1EBzf9u3gvf5xhVNj/JamrIiNGJTaFqpmGy8+pO4kA4FqWUQIJwMvcZiDg4EhZ/rZgavhNUaTFFn3DfjNA+3VcWqF6vqftq0BtuMQ1zdQ9to5qCcS3HdptzaBUEo1KK6UiSdvWq6YSvp7R3cFWcHg/TzIPn+qX98tN3PDnvL3rN9lrJDdsy+sSEbMcF+sd/sD/sbvYm+RCfR6SY06mxz9ljLouENxd3QTA==</latexit> p(audio|notes) Voice conversion <latexit sha1_base64="qCnWH7hHL5prGMhtqwPIklqtHAQ=">AAACnHicbZFLSxxBEMd7RxN1zWNNjgEZXARzWWYkokchOQRCQCG7CjvD0t1Tu9tsv+iuUZfJ3P00uepXybdJ7+OQUQua/vevqqiqLmal8Jgkf1vRxuar11vbO+3dN2/fve/sfRh4UzoOfW6kcdeMepBCQx8FSri2DqhiEq7Y7OvCf3UDzgujf+HcQq7oRIux4BQDGnUO7FGGcIeVtwB8Wo/S38338edRp5v0kqXFz0W6Fl2ytovRXotlheGlAo1cUu+HaWIxr6hDwSXU7az0YCmf0QkMg9RUgc+r5TB1fBhIEY+NC0djvKT/Z1RUeT9XLEQqilP/1LeAL/mGJY7P8kpoWyJovio0LmWMJl78TFwIBxzlPAjKnQi9xnxKHeUY/q+dabjlRimqiypjvh6mebiNLBa9GFl107ox2GocZPIF2kQTR+1U8LsmZcbMkIZCDapKicKZ27odtpI+3cFzMTjupSe95PJL9/xsvZ9t8okckCOSklNyTr6TC9InnNyTP+SBPEb70bfoR/RzFRq11jkfScOiwT/CitJe</latexit> p(speech1|speech2) <latexit sha1_base64="0cEBiVQeRCFX5q0A6TnvZPw0BTo=">AAACmnicbZFLbxMxEMed5VXCK4UjHCwipHKJdisQPVZwAXEpUtNWyq6isXeSWPFjZc/SRste+TRc4bvwbXAeB7btSJb//s2MZsYjKq0CpenfXnLn7r37D/Ye9h89fvL02WD/+VlwtZc4lk47fyEgoFYWx6RI40XlEYzQeC6Wn9b+8+/og3L2lFYVFgbmVs2UBIpoOuDVQU54RQ3UpXLtNPvReR6+nQ6G6SjdGL8psp0Ysp2dTPd7Ii+drA1akhpCmGRpRUUDnpTU2PbzOmAFcglznERpwWAoms0oLX8TSclnzsdjiW/o/xkNmBBWRsRIA7QI131reJtvUtPsqGiUrWpCK7eFZrXm5Pj6X3ipPErSqyhAehV75XIBHiTF3+vnFi+lMwZs2eQitJOsiLfT5boXp5th1nYG245DQt9Cu2juoVooedWlwrklQSzUoabWpLy7bPtxK9n1HdwUZ4ej7P0o/fZueHy0288ee8leswOWsQ/smH1mJ2zMJPvJfrHf7E/yKvmYfEm+bkOT3i7nBetYcvoPs03Rfg==</latexit> p(audio1|audio2) Timbre conversion MIDI-to-audio
  • 3. 3 TTS & MIDI-to-Audio Synthesis 0 100 200 300 400 500 Frame index 256 513 Frequency bins Acoustic model Waveform model Wav MIDI Piano roll Acoustic features MIDI API Acoustic model Waveform model Context vectors Acoustic features Front end Wav .txt <latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit> x1:M <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit> a1:N <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit> a1:N <latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit> o1:T <latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit> o1:T <latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit> x1:N <latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit> x1:M or Pitch is crucial for music
  • 4. 4 <latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit> x1:N <latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit> x1:M or TTS & MIDI-to-Audio Synthesis https://commons.wikimedia.org/wiki/File:Computer_music_piano_roll.png 0 100 200 300 400 500 Frame index 256 513 Frequency bins Acoustic model Waveform model Wav MIDI Piano roll Acoustic features MIDI API Acoustic model Waveform model Context vectors Acoustic features Front end Wav .txt <latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit> x1:M <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit> a1:N <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit> a1:N <latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit> o1:T <latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit> o1:T .txt Music score MIDI piano roll MIDI API
  • 5. 5 Acoustic model Waveform model Wav MIDI Piano roll Acoustic features MIDI API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit> a1:N <latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit> o1:T <latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit> x1:N <latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit> x1:M or TTS & MIDI-to-Audio Synthesis 0 100 200 300 400 500 Frame index 256 513 Frequency bins Acoustic model Waveform model Context vectors Acoustic features Front end Wav .txt <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit> a1:N <latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit> o1:T 1 0 0 Phone ID (one-hot) Syllable ID (one-hot) #. Phone #. Syllables <latexit sha1_base64="YYtcP5HcZDv4wo3N4OOgsso9tf8=">AAAC6nicbVLLahsxFJWnr8R9Je0ym6Gm0E3MTElploFuugouxInBMxiN5toW1mOQ7jQ2Qj+RXUmW/Zz+QP+m8iPQmeSC0OHce+5LKirBLSbJ30705Omz5y/29rsvX71+8/bg8N2l1bVhMGRaaDMqqAXBFQyRo4BRZYDKQsBVsfi29l/9BGO5Vhe4qiCXdKb4lDOKgRplhXVLP1GTg17STzYWPwTpDvTIzgaTw86frNSslqCQCWrtOE0qzB01yJkA381qCxVlCzqDcYCKSrC52zTs44+BKeOpNuEojDfs/wpHpbUrWYRISXFu2741+ZhvXOP0NHdcVTWCYttC01rEqOP19HHJDTAUqwAoMzz0GrM5NZRh2FE3U3DNtJRUlS5sxo/TPNxalOtetHC91DcG246DhXiEbVIzQ6s5Z0vfbRYZXBhOhXcZmwNbSGoWvhUAymgRIo5bjvN7JcISNxWdgbKRqa24T9WWHHsf3j9tv/ZDcPm5n37pJz9Oemfp7ifskSPygXwiKflKzsh3MiBDwoggN+SW3EUiuol+Rbfb0Kiz07wnDYt+/wP+EPS4</latexit> xn <latexit sha1_base64="YYtcP5HcZDv4wo3N4OOgsso9tf8=">AAAC6nicbVLLahsxFJWnr8R9Je0ym6Gm0E3MTElploFuugouxInBMxiN5toW1mOQ7jQ2Qj+RXUmW/Zz+QP+m8iPQmeSC0OHce+5LKirBLSbJ30705Omz5y/29rsvX71+8/bg8N2l1bVhMGRaaDMqqAXBFQyRo4BRZYDKQsBVsfi29l/9BGO5Vhe4qiCXdKb4lDOKgRplhXVLP1GTg17STzYWPwTpDvTIzgaTw86frNSslqCQCWrtOE0qzB01yJkA381qCxVlCzqDcYCKSrC52zTs44+BKeOpNuEojDfs/wpHpbUrWYRISXFu2741+ZhvXOP0NHdcVTWCYttC01rEqOP19HHJDTAUqwAoMzz0GrM5NZRh2FE3U3DNtJRUlS5sxo/TPNxalOtetHC91DcG246DhXiEbVIzQ6s5Z0vfbRYZXBhOhXcZmwNbSGoWvhUAymgRIo5bjvN7JcISNxWdgbKRqa24T9WWHHsf3j9tv/ZDcPm5n37pJz9Oemfp7ifskSPygXwiKflKzsh3MiBDwoggN+SW3EUiuol+Rbfb0Kiz07wnDYt+/wP+EPS4</latexit> xn A4, velocity 0.4 0.9 0 D5, velocity 0.4 … 0.4 … 0 <latexit sha1_base64="YYtcP5HcZDv4wo3N4OOgsso9tf8=">AAAC6nicbVLLahsxFJWnr8R9Je0ym6Gm0E3MTElploFuugouxInBMxiN5toW1mOQ7jQ2Qj+RXUmW/Zz+QP+m8iPQmeSC0OHce+5LKirBLSbJ30705Omz5y/29rsvX71+8/bg8N2l1bVhMGRaaDMqqAXBFQyRo4BRZYDKQsBVsfi29l/9BGO5Vhe4qiCXdKb4lDOKgRplhXVLP1GTg17STzYWPwTpDvTIzgaTw86frNSslqCQCWrtOE0qzB01yJkA381qCxVlCzqDcYCKSrC52zTs44+BKeOpNuEojDfs/wpHpbUrWYRISXFu2741+ZhvXOP0NHdcVTWCYttC01rEqOP19HHJDTAUqwAoMzz0GrM5NZRh2FE3U3DNtJRUlS5sxo/TPNxalOtetHC91DcG246DhXiEbVIzQ6s5Z0vfbRYZXBhOhXcZmwNbSGoWvhUAymgRIo5bjvN7JcISNxWdgbKRqa24T9WWHHsf3j9tv/ZDcPm5n37pJz9Oemfp7ifskSPygXwiKflKzsh3MiBDwoggN+SW3EUiuol+Rbfb0Kiz07wnDYt+/wP+EPS4</latexit> xn <latexit sha1_base64="YYtcP5HcZDv4wo3N4OOgsso9tf8=">AAAC6nicbVLLahsxFJWnr8R9Je0ym6Gm0E3MTElploFuugouxInBMxiN5toW1mOQ7jQ2Qj+RXUmW/Zz+QP+m8iPQmeSC0OHce+5LKirBLSbJ30705Omz5y/29rsvX71+8/bg8N2l1bVhMGRaaDMqqAXBFQyRo4BRZYDKQsBVsfi29l/9BGO5Vhe4qiCXdKb4lDOKgRplhXVLP1GTg17STzYWPwTpDvTIzgaTw86frNSslqCQCWrtOE0qzB01yJkA381qCxVlCzqDcYCKSrC52zTs44+BKeOpNuEojDfs/wpHpbUrWYRISXFu2741+ZhvXOP0NHdcVTWCYttC01rEqOP19HHJDTAUqwAoMzz0GrM5NZRh2FE3U3DNtJRUlS5sxo/TPNxalOtetHC91DcG246DhXiEbVIzQ6s5Z0vfbRYZXBhOhXcZmwNbSGoWvhUAymgRIo5bjvN7JcISNxWdgbKRqa24T9WWHHsf3j9tv/ZDcPm5n37pJz9Oemfp7ifskSPygXwiKflKzsh3MiBDwoggN+SW3EUiuol+Rbfb0Kiz07wnDYt+/wP+EPS4</latexit> xn <latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit> x1:M
  • 6. 6 <latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit> x1:M TTS & MIDI-to-Audio Synthesis 0 100 200 300 400 500 Frame index 256 513 Frequency bins Acoustic model Waveform model Wav MIDI Piano roll Acoustic features MIDI API Acoustic model Waveform model Context vectors Acoustic features Front end Wav .txt <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit> a1:N <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit> a1:N <latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit> o1:T <latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit> o1:T Not completely aligned AI performer <latexit sha1_base64="Ce+Vsx7LGNLmrE4StAaBzOgsSt0=">AAADInicbVLLitswFFXc1zR9ZdplN6Kh0M0Eu7S0dFEGuummQwrJzEBswrWsxCJ6GEnuJBj9zXzN0E2Zbgr9mMpOBmpnLggdzrlHV/dKacGZsWH4pxfcuXvv/oODh/1Hj588fTY4fH5qVKkJnRLFlT5PwVDOJJ1aZjk9LzQFkXJ6lq6+1PrZD6oNU3JiNwVNBCwlWzAC1lPzwec4NdXazavo0zeHY82WuQWt1QWuBWiEk31BNcLEzQfDcBQ2gfdBtANDtIvx/LD3M84UKQWVlnAwZhaFhU0q0JYRTl0/Lg0tgKxgSWceShDUJFXTqMOvPZPhhdJ+SYsb9n9HBcKYjUh9pgCbm65Wk7dps9IuPiYVk0VpqSTbQouSY6twPTWcMU2J5RsPgGjm74pJDhqI9bPtx5JeECUEyKzy03GzKPG74ll9F8WrYeRajW3bsSm/hW1TSw1Fzsja9dtFxhPNgLsqJjklKwF65ToJVGrFfcZRRzi5cVq6tk3FStOsdVLXcXNU13Lk6vePuq+9D07fjqL3o/D7u+FxtPsJB+gleoXeoAh9QMfoKxqjKSLoEl2ha/Q7uAyugl/B9TY16O08L1Argr//ADF1C2g=</latexit> x1:M ! a1:N ! o1:T <latexit sha1_base64="uTpzxJX9ikqk62YuUY3zkv6CAV4=">AAADInicdVJda9swFFW8ry77SrfHvYiFwV4a7LGx0YdR2MueSgZJW4hNuJaVWESWjCS3CUL/pr+m7GV0L4P9mMlOCrPbXRA6nHOPru6V0pIzbcLwdy+4d//Bw0d7j/tPnj57/mKw//JEy0oROiWSS3WWgqacCTo1zHB6VioKRcrpabr6Wuun51RpJsXEbEqaFLAUbMEIGE/NB1/iVNu1m9vo8NjhWLFlbkApeYFrAf4nyEaYuPlgGI7CJvBtEO3AEO1iPN/v/YgzSaqCCkM4aD2LwtIkFpRhhFPXjytNSyArWNKZhwIKqhPbNOrwW89keCGVX8Lghv3XYaHQelOkPrMAk+uuVpN3abPKLD4nlomyMlSQbaFFxbGRuJ4azpiixPCNB0AU83fFJAcFxPjZ9mNBL4gsChCZ9dNxsyjxu+RZfRfJ7TByrca27ZiU38G2qaWCMmdk7frtIuOJYsCdjUlOyaoAtXKdBCqU5D7joCMc3zgNXZumolU0a53Uddwc1bUcuPr9o+5r3wYn70fRx1H4/cPwKNr9hD30Gr1B71CEPqEj9A2N0RQRdImu0DX6FVwGV8HP4HqbGvR2nleoFcGfvzQnC2k=</latexit> x1:N ! a1:N ! o1:T Aligned Silent instrument <latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit> x1:N <latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit> x1:M or
  • 7. 7 <latexit sha1_base64="AisTKh5Vue6nwtVsE4yuaaGZKEs=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTTctLsRJwDOYO5prW1iPQdI0NkK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRffuP3j4aO9x/8nTZ89f7B+8PDOq1hTHVHGlLwowyJnEsWWW40WlEUTB8bxYfm785z9QG6bkqV1XmAuYSzZjFGygsqwwbuWnLj3+6qf7g2SYbCy+DdIdGJCdjaYHvT9ZqWgtUFrKwZhJmlQ2d6Atoxx9P6sNVkCXMMdJgBIEmtxtmvbx28CU8UzpcKSNN+z/CgfCmLUoQqQAuzBdX0Pe5ZvUdnaUOyar2qKk20KzmsdWxc0G4pJppJavAwCqWeg1pgvQQG3YUz+TeEmVECBLF7bjJ2kebsXLphfF3SD1rcG249iC38G2qbmGasHoyvfbRUanmgH3LqMLpEsBeuk7ASi14iHisOP4dqO0uLKbik5j2crUVdyk6koOffP+afe1b4Oz98P04zD5/mFwcrT7CXvkNXlD3pGUfCIn5AsZkTGhpCJX5Jr8iqroKvoZXW9Do95O84q0LPr9D8sl9ik=</latexit> x1:M TTS & MIDI-to-Audio Synthesis 0 100 200 300 400 500 Frame index 256 513 Frequency bins Acoustic model Waveform model Wav MIDI Piano roll Acoustic features MIDI API Acoustic model Waveform model Context vectors Acoustic features Front end Wav .txt <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit> a1:N <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit> a1:N <latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit> o1:T <latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit> o1:T <latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit> x1:N Apply TTS techniques to MIDI-to-Audio
  • 8. 8 Methods Wang, Y. et al. Tacotron: Towards End-to-End Speech Synthesis. in Proc. Interspeech 4006–4010 (2017). Yasuda, Y., Wang, X., Takaki, S. & Yamagishi, J. Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language. in Proc. ICASSP 6905–6909 (2019). q Acoustic model 1. TTS model: Tacotron (Wang 2017, Yasuda 2019) Acoustic model Waveform model Wav MIDI API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit> a1:N <latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit> o1:T <latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit> x1:N Dense CBH- LSTM Self Attention Attention RNN Additive Attention Forward Attention Concat Concat Stop Token Pre-Net Decoder RNN Self Attention Sigmoid Linear Waveform model Post-Net Wav ⊕ ⊕ ⊕ Encoder Decoder • Taco2: 800-frame segments; 4x downsampling for better alignments • Taco3: Warm- started from taco2; input current piano- roll frame at the decoder pre-net • Taco4: Warm- started from taco2; no downsampling or piano-roll input to pre-net
  • 9. 9 Methods Wang, B. & Yang, Y.-H. PerformanceNet: Score-to-audio music generation with multi-band convolutional residual network. in Proceedings of the AAAI Conference on Artificial Intelligence vol. 33 1174–1181 (2019). q Acoustic model 2. Reference model from music field: PerformanceNet (Wang 2019) Acoustic model Waveform model Wav MIDI API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit> a1:N <latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit> o1:T <latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit> x1:N Adopted from figure 3 in (Wang 2019) <latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit> x1:N <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit> a1:N
  • 10. 10 Methods Librosa midi_to_hz: https://librosa.org/doc/0.7.0/generated/librosa.core.midi_to_hz.html q Acoustic features 1. Mel-spectrogram 2. MIDI-filterbank-spectrogram Acoustic model Waveform model Wav MIDI API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit> a1:N <latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit> o1:T <latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit> x1:N 120 233 346 460 Frame index 16 32 48 64 80 96 112 128 Pinao roll index (1-128) Piano roll Mel filter bank MIDI filter bank E4 E6 E8 Frequency (0-12 kHz) 7 29 61 Mel filter index (1-80) E4 E6 E8 Frequency (0-12 kHz) 64 88 112 MIDI filter index (1-128) 120 233 346 460 Frame index 7 29 61 Dimension index (1-80) Mel-spectrum 120 233 346 460 Frame index 64 88 112 Dimension index (1-128) MIDI-fb-spectrum Mel-spec. MIDI-fb-spec. <latexit sha1_base64="6X1VLiCNlG3P2EJtZGp3iOqzN64=">AAACl3icbZFbaxNBFMcn663GW6pPIsJgEOqDcTfEWh/EgiB9bMG0hewazszOJkPmssycbQ3LPvlpfNVP47dxcnlw2x4Y5s/vnMO5sVJJj3H8txPdun3n7r2d+90HDx89ftLbfXrqbeW4GHOrrDtn4IWSRoxRohLnpROgmRJnbPFl5T+7EM5La77hshSZhpmRheSAAU17Lwv6iQ6/13uLt/sf37yjybChKUotPB2N4mmvHw/itdHrItmKPtna8XS3w9Lc8koLg1yB95MkLjGrwaHkSjTdtPKiBL6AmZgEaSBUyur1HA19HUhOC+vCM0jX9P+MGrT3S81CpAac+6u+FbzJN6mwOMhqacoKheGbQkWlKFq6WgrNpRMc1TII4E6GXimfgwOOYXXd1IhLbrUGk9cp880kycJvVb7qxaq6nzStwTbjIFM30DaaOSjnkv9oU2btAiEUalFdKZTOXjbdcJXk6g2ui9PhIHk/iE9G/cOD7X12yAvyiuyRhHwgh+SIHJMx4eQn+UV+kz/R8+hz9DU62oRGnW3OM9Ky6OQfqXrMdw==</latexit> f = 2(k 69)/12 ⇥ 440 <latexit sha1_base64="86+t9jn0UkxzHxA2if7/PbZ7/JQ=">AAACeXicbZHNThtBDMcn2xbSUGhoj71sGyEhDtEuLSrHSL30CFIDSMkKeWa9ySjzsZrxlkarfYJe24fjWbh08nHoApas+etnW7bHvFTSU5Lcd6IXL1/t7HZf9/be7B+87R++u/K2cgLHwirrbjh4VNLgmCQpvCkdguYKr/ni2yp+/ROdl9b8oGWJmYaZkYUUQAFdFrf9QTJM1hY/FelWDNjWLm4PO3yaW1FpNCQUeD9Jk5KyGhxJobDpTSuPJYgFzHASpAGNPqvXkzbxUSB5XFgX3FC8pv9X1KC9X2oeMjXQ3D+OreBzsUlFxXlWS1NWhEZsGhWVisnGq7XjXDoUpJZBgHAyzBqLOTgQFD6nNzV4J6zWYPJ6yn0zSbPwWpWvZrGqHqRNa7HNOsTVM7SNZg7KuRS/2pRbuyAIjVpUV4qks3dNL1wlfXyDp+LqdJieDZPLL4PR+fY+XfaBfWLHLGVf2Yh9ZxdszARD9pv9YX87D9HH6Dg62aRGnW3Ne9ay6PM/WXrEyA==</latexit> f <latexit sha1_base64="JbTkpy7RgEzAoI7aVLnTtzC7Ygk=">AAACeXicbZHNThtBDMcn2xZoKBDaYy9LIyTEIdqFonJE4tIjSA0gJSvkmXWSUeZjNeMtjVb7BFzh4XgWLp18HFjAkjV//WzL9pgXSnpKkqdW9OHjp7X1jc/tzS9b2zud3a9X3pZOYF9YZd0NB49KGuyTJIU3hUPQXOE1n57P49d/0XlpzR+aFZhpGBs5kgIooMvpbaeb9JKFxW9FuhJdtrKL290WH+ZWlBoNCQXeD9KkoKwCR1IorNvD0mMBYgpjHARpQKPPqsWkdbwfSB6PrAtuKF7QlxUVaO9nmodMDTTxr2Nz+F5sUNLoNKukKUpCI5aNRqWKycbzteNcOhSkZkGAcDLMGosJOBAUPqc9NHgnrNZg8mrIfT1Is/Balc9nsarqpnVjseU6xNU7tInGDoqJFP+alFs7JQiNGlSXiqSzd3U7XCV9fYO34uqol570ksuf3bPT1X022Hf2gx2wlP1iZ+w3u2B9Jhiye/bAHlvP0V50EB0uU6PWquYba1h0/B9j48TN</latexit> k Librosa.midi_to_hz
  • 11. 11 Methods Blank lines in MIDI-fb-spec. due to frequency resolution of FFT (see appendix) q Acoustic features 1. Mel-spectrogram 2. MIDI-filterbank-spectrogram Acoustic model Waveform model Wav MIDI API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit> a1:N <latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit> o1:T <latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit> x1:N 120 233 346 460 Frame index 16 32 48 64 80 96 112 128 Pinao roll index (1-128) Piano roll Mel filter bank MIDI filter bank E4 E6 E8 Frequency (0-12 kHz) 7 29 61 Mel filter index (1-80) E4 E6 E8 Frequency (0-12 kHz) 64 88 112 MIDI filter index (1-128) 120 233 346 460 Frame index 7 29 61 Dimension index (1-80) Mel-spectrum 120 233 346 460 Frame index 64 88 112 Dimension index (1-128) MIDI-fb-spectrum Mel-spec. MIDI-fb-spec.
  • 12. 12 Methods Zhao, Y., Wang, X., Juvela, L. & Yamagishi, J. Transferring neural speech waveform synthesizers to musical instrument sounds generation. in Proc. ICASSP 6269–6273 (IEEE, 2020). doi:10.1109/ICASSP40776.2020.9053047 q Waveform model § Based on music neural source-filter (NSF) model (Zhao 2020) but • No harmonic-plus-noise structure Acoustic model Waveform model Wav MIDI API <latexit sha1_base64="sgmOfknO9phsp1rj4lSDCn9gz0Q=">AAAC7nicbVLLahsxFJWnr9R9JGmX3Qw1hW5iZkpLQleBbroKLsRJwDOYO5prW1iPQdK0MUK/kV1Jl/mY/ED/phrbgc4kF4QO595zX1JRcWZskvztRY8eP3n6bOd5/8XLV6939/bfnBlVa4pjqrjSFwUY5Ezi2DLL8aLSCKLgeF4svzX+85+oDVPy1K4qzAXMJZsxCjZQWVYYB37q0q8nfro3SIbJ2uL7IN2CAdnaaLrfu81KRWuB0lIOxkzSpLK5A20Z5ej7WW2wArqEOU4ClCDQ5G7dtI8/BKaMZ0qHI228Zv9XOBDGrEQRIgXYhen6GvIh36S2s6PcMVnVFiXdFJrVPLYqbjYQl0wjtXwVAFDNQq8xXYAGasOe+pnEX1QJAbJ0YTt+kubhVrxselHcDVLfGmwzji34A2ybmmuoFoxe+n67yOhUM+DeZXSBdClAL30nAKVWPEQcdBwnd0qLl3Zd0WksW5m6irtUXcmBb94/7b72fXD2aZh+GSY/Pg+Oj7Y/YYe8I+/JR5KSQ3JMvpMRGRNKKnJFrsmfqIquot/R9SY06m01b0nLopt/k5v2Ew==</latexit> a1:N <latexit sha1_base64="JElF7obLjklrepu+jo2EhDrW7Zk=">AAAC7nicbVLLahsxFJWnj6TuK2mX3Qw1hW5iZkpKQ1eBbroqLthJwDOYO5prW1iPQdI0MUK/kV1Jl/2Y/kD/phrbgc4kF4QO595zX1JRcWZskvztRQ8ePnq8t/+k//TZ8xcvDw5fnRlVa4oTqrjSFwUY5EzixDLL8aLSCKLgeF6svjT+8x+oDVNybNcV5gIWks0ZBRuoLCuMU37m0s9jPzsYJMNkY/FdkO7AgOxsNDvs/clKRWuB0lIOxkzTpLK5A20Z5ej7WW2wArqCBU4DlCDQ5G7TtI/fBaaM50qHI228Yf9XOBDGrEURIgXYpen6GvI+37S285PcMVnVFiXdFprXPLYqbjYQl0wjtXwdAFDNQq8xXYIGasOe+pnES6qEAFm6sB0/TfNwK142vSjuBqlvDbYdxxb8HrZNLTRUS0avfL9dZDTWDLh3GV0iXQnQK98JQKkVDxFHHce3W6XFK7up6DSWrUxdxW2qruTIN++fdl/7Ljj7MEw/DpPvx4PTk91P2CdvyFvynqTkEzklX8mITAglFbkmN+RXVEXX0c/oZhsa9Xaa16Rl0e9/xe/2Jw==</latexit> o1:T <latexit sha1_base64="jJ70daV3Lv6VV/gW3FR+Rb1BrxU=">AAAC7nicbVLLahsxFJWnr9R9Je2ym6Gm0E3MTGlpyCrQTVfBhTgJeAZzR3NtC+sxSJrGRug3sivpsh/TH+jfVGM70JnkgtDh3HvuSyoqzoxNkr+96MHDR4+f7D3tP3v+4uWr/YPX50bVmuKYKq70ZQEGOZM4tsxyvKw0gig4XhTLr43/4gdqw5Q8s+sKcwFzyWaMgg1UlhXGrfzUpcenfro/SIbJxuK7IN2BAdnZaHrQ+5OVitYCpaUcjJmkSWVzB9oyytH3s9pgBXQJc5wEKEGgyd2maR+/D0wZz5QOR9p4w/6vcCCMWYsiRAqwC9P1NeR9vkltZ0e5Y7KqLUq6LTSreWxV3GwgLplGavk6AKCahV5jugAN1IY99TOJV1QJAbJ0YTt+kubhVrxselHcDVLfGmw7ji34PWybmmuoFoyufL9dZHSmGXDvMrpAuhSgl74TgFIrHiIOO47TW6XFld1UdBrLVqau4jZVV3Lom/dPu699F5x/HKafh8n3T4OTo91P2CNvyTvygaTkCzkh38iIjAklFbkmN+RXVEXX0c/oZhsa9XaaN6Rl0e9/zaX2Kg==</latexit> x1:N Bi-LSTM 1D CNN Up-sampling Acoustic features Excitation signal Block 5 Wav … FC Dilated 1D conv Dilated 1D conv FC Block 1 … Condition module Neural filter module
  • 13. 13 Experiments Hawthorne, C. et al. Enabling factorized piano music modeling and generation with the MAESTRO dataset. in Proc. ICLR (2018). https://piano-e-competition.com/ q Database: MAESTRO v2.0 (Hawthorne 2018) § Real piano performances in International Piano-e-Competition § MIDI was recorded simultaneously during performance • Aligned audio & piano roll § For experiments: • Follow official data split • 24kHz, 16 bits PCM Data split of MAESTRO From https://magenta.tensorflow.org/datasets/maestro#v200
  • 14. 14 Experiments https://www.fluidsynth.org/ https://www.modartt.com/pianoteq https://github.com/bwang514/PerformanceNet q Models in comparison Software <latexit sha1_base64="QvEzsRPQOsJO2uPs95Rt4gD1HG8=">AAAM5XiclVbdbuNEFPYuC9mGAFu45MaiYrUg2bId56d3i1i6RNotQaVppLqqxpNJYtX2BM9428jyI3CHkOCGV+EW3oC34cxMksY/SYVTy9PxOd985ztzzthfhAHjlvXvo8fvPXn/g8bTg+aHrY8+/uTZ4acjRtMEk3NMQ5qMfcRIGMTknAc8JONFQlDkh+TCv/lWvL94RxIW0PgnvlyQqwjN4mAaYMRh6vqw0fZ8MgvijCM/DVGSZ3h15U2P00WShqTpRWnIg4TeZk6efZ1nZ0vGSaQPXuX6c7380mNzmnDGEb7JvsE0ZTzAuufpEZ2QMP8/HlOCeJqQB32+u8MBN4UHC2YxeniRC/SOmHWcQM80ioUHzrNhwPFcjwIWIRg8hPn2hzOB+CIiKP4qzz2v6WGRlKxr9PKmXnM9lz/xB3dMOYEHBsCJWAnco2Ai1T8VKqAQZo0dt2tabsHFY0s2HozKcbl5FkL2ULQIiSF2zUR/O3g1MDg1UDoJqM7olN8ikFyw6piOJZ5ds9cTFNtmtytXEeA7sIfzJYO9FRpS2X3ortl3FHpHrJK5ptPJi0Egn8kYRIgwGRhTH0YMRIXH6dmJFE5poPj1exJA+A0qfjEN2Max6Nfb8ts4ktBgC2wWV9z2c8yes/Greu1Zr29XsjVSK3uc3HHOoRoxdfJdcR+AbiBY186Vgm1X0jnurdMzeghsm9yBzAbkVmF1JUW7v0m1XoJq7+aloHp9BeX2FdTxGmow3g9VZdU/Xu2RdglqXInQ3SsXQK0DPFZi3Qc4KEfoPkQLlFYROgqrcy88TAxPTgnfrZFUqWN2ZGg903ZVaNZGpVEtRoHFGkRJ3TMdlTXrnsg2htyUpLyXNyAq9cBERGWLulwzGQx2wJTIKC7uiktHidLeypWUeA20BhEl8fqN5AEIHdNtq7bYheTIKdvsOpVK2VSoUS5NSEx71VNcS1Vof5Plba9SaYKfu+p0Vl+6dRV3n3JOI7UyiSebM/L62ZFlWvLSqwN7NTjSVtfw+vDJ796E4jQiMcchYuzSthb8KkMJ9M+QwFGbMrKAYwTNyCUMYxQRdpXJ4z3Xv0xFp57SBO6Y63J22yNDEWPLyAdLOKrmrPxOTNa9u0z5tH+VBfEi5STGaqFpGuqc6uJbQZ8ECcE8XMIA4SQQxzKeowRhDl8UTS8mt5hGEQJlPJ/ll/YVPGk4EVxomB3ZeSEwFQ73w5rZ4tQsQYt5gO+Ksz6lN5ABVpxdH8l5s0hINGU4oVdVDf8ZYv9C5gWpsmHZcurvtKwBhU1YZzqqgtZYiq15byjLHZY2YjatsyybAuBO0y3bxTQmfJcAsuVUbOslEJ2lDrY+sFGVwk4RRtt85Ym1j8Og3nqHvuOibXsPcq1pPey4RMHdB1vm6+4FruZO9Ezj9Zta662MnIRpAPUXQ8HXWG4ZDgMkPjl/zkUzs8utqzoYOabdMa0f3aOX/VVbe6p9rn2hvdBsrae91L7Xhtq5hht/NP5q/N34pzVr/dL6tfWbMn38aOXzmVa4Wn/+B5czF8c=</latexit> System ID Acoustic model Acoustic feature Excit. signal Wave. model Pitch mismatch MOS (mean) note chord Natural - - - - - - 4.04 Fluidsynth Sample-based MIDI-to-audio software 5.20 6.77 3.66 Pianoteq Physical-model MIDI-to-audio software 4.82 6.50 4.25 abs-mfb-sin - midi-fb sine NSF - - 3.87 abs-mfb-noi - midi-fb noise NSF - - 3.77 abs-mel-sin - mel-spc. sine NSF - - 2.72 abs-mel-noi - mel-spc. noise NSF - - 3.81 taco2-mfb-sin taco2 midi-fb sine NSF 4.61 6.34 2.97 taco2-mfb-noi taco2 midi-fb noise NSF 4.66 6.36 3.18 taco3-mfb-sin taco3 midi-fb sine NSF 4.78 6.48 3.19 taco3-mfb-noi taco3 midi-fb noise NSF 4.89 6.53 3.19 taco4-mfb-sin taco4 midi-fb sine NSF 4.86 6.39 2.98 taco4-mfb-noi taco4 midi-fb noise NSF 4.97 6.42 2.95 pfnet-mfb-sin PFNet midi-fb sine NSF 5.59 7.14 3.10 pfnet-mfb-noi PFNet midi-fb noise NSF 5.78 7.26 3.05 pfnet-mel-sin PFNet mel-spec. sine NSF 5.66 7.17 1.82 pfnet-mel-noi PFNet mel-spec. noise NSF 5.74 7.25 2.93 pfnet-spec-GL PFNet spec. - GL 5.43 6.98 1.62 midi-sin-nsf - - sine NSF 4.32 6.40 2.88 midi-noi-nsf - - noise NSF 4.40 6.08 2.63 Software Original PerformanceNet Open-source code
  • 15. 15 Experiments https://www.fluidsynth.org/ https://www.modartt.com/pianoteq q Models in comparison <latexit sha1_base64="QvEzsRPQOsJO2uPs95Rt4gD1HG8=">AAAM5XiclVbdbuNEFPYuC9mGAFu45MaiYrUg2bId56d3i1i6RNotQaVppLqqxpNJYtX2BM9428jyI3CHkOCGV+EW3oC34cxMksY/SYVTy9PxOd985ztzzthfhAHjlvXvo8fvPXn/g8bTg+aHrY8+/uTZ4acjRtMEk3NMQ5qMfcRIGMTknAc8JONFQlDkh+TCv/lWvL94RxIW0PgnvlyQqwjN4mAaYMRh6vqw0fZ8MgvijCM/DVGSZ3h15U2P00WShqTpRWnIg4TeZk6efZ1nZ0vGSaQPXuX6c7380mNzmnDGEb7JvsE0ZTzAuufpEZ2QMP8/HlOCeJqQB32+u8MBN4UHC2YxeniRC/SOmHWcQM80ioUHzrNhwPFcjwIWIRg8hPn2hzOB+CIiKP4qzz2v6WGRlKxr9PKmXnM9lz/xB3dMOYEHBsCJWAnco2Ai1T8VKqAQZo0dt2tabsHFY0s2HozKcbl5FkL2ULQIiSF2zUR/O3g1MDg1UDoJqM7olN8ikFyw6piOJZ5ds9cTFNtmtytXEeA7sIfzJYO9FRpS2X3ortl3FHpHrJK5ptPJi0Egn8kYRIgwGRhTH0YMRIXH6dmJFE5poPj1exJA+A0qfjEN2Max6Nfb8ts4ktBgC2wWV9z2c8yes/Greu1Zr29XsjVSK3uc3HHOoRoxdfJdcR+AbiBY186Vgm1X0jnurdMzeghsm9yBzAbkVmF1JUW7v0m1XoJq7+aloHp9BeX2FdTxGmow3g9VZdU/Xu2RdglqXInQ3SsXQK0DPFZi3Qc4KEfoPkQLlFYROgqrcy88TAxPTgnfrZFUqWN2ZGg903ZVaNZGpVEtRoHFGkRJ3TMdlTXrnsg2htyUpLyXNyAq9cBERGWLulwzGQx2wJTIKC7uiktHidLeypWUeA20BhEl8fqN5AEIHdNtq7bYheTIKdvsOpVK2VSoUS5NSEx71VNcS1Vof5Plba9SaYKfu+p0Vl+6dRV3n3JOI7UyiSebM/L62ZFlWvLSqwN7NTjSVtfw+vDJ796E4jQiMcchYuzSthb8KkMJ9M+QwFGbMrKAYwTNyCUMYxQRdpXJ4z3Xv0xFp57SBO6Y63J22yNDEWPLyAdLOKrmrPxOTNa9u0z5tH+VBfEi5STGaqFpGuqc6uJbQZ8ECcE8XMIA4SQQxzKeowRhDl8UTS8mt5hGEQJlPJ/ll/YVPGk4EVxomB3ZeSEwFQ73w5rZ4tQsQYt5gO+Ksz6lN5ABVpxdH8l5s0hINGU4oVdVDf8ZYv9C5gWpsmHZcurvtKwBhU1YZzqqgtZYiq15byjLHZY2YjatsyybAuBO0y3bxTQmfJcAsuVUbOslEJ2lDrY+sFGVwk4RRtt85Ym1j8Og3nqHvuOibXsPcq1pPey4RMHdB1vm6+4FruZO9Ezj9Zta662MnIRpAPUXQ8HXWG4ZDgMkPjl/zkUzs8utqzoYOabdMa0f3aOX/VVbe6p9rn2hvdBsrae91L7Xhtq5hht/NP5q/N34pzVr/dL6tfWbMn38aOXzmVa4Wn/+B5czF8c=</latexit> System ID Acoustic model Acoustic feature Excit. signal Wave. model Pitch mismatch MOS (mean) note chord Natural - - - - - - 4.04 Fluidsynth Sample-based MIDI-to-audio software 5.20 6.77 3.66 Pianoteq Physical-model MIDI-to-audio software 4.82 6.50 4.25 abs-mfb-sin - midi-fb sine NSF - - 3.87 abs-mfb-noi - midi-fb noise NSF - - 3.77 abs-mel-sin - mel-spc. sine NSF - - 2.72 abs-mel-noi - mel-spc. noise NSF - - 3.81 taco2-mfb-sin taco2 midi-fb sine NSF 4.61 6.34 2.97 taco2-mfb-noi taco2 midi-fb noise NSF 4.66 6.36 3.18 taco3-mfb-sin taco3 midi-fb sine NSF 4.78 6.48 3.19 taco3-mfb-noi taco3 midi-fb noise NSF 4.89 6.53 3.19 taco4-mfb-sin taco4 midi-fb sine NSF 4.86 6.39 2.98 taco4-mfb-noi taco4 midi-fb noise NSF 4.97 6.42 2.95 pfnet-mfb-sin PFNet midi-fb sine NSF 5.59 7.14 3.10 pfnet-mfb-noi PFNet midi-fb noise NSF 5.78 7.26 3.05 pfnet-mel-sin PFNet mel-spec. sine NSF 5.66 7.17 1.82 pfnet-mel-noi PFNet mel-spec. noise NSF 5.74 7.25 2.93 pfnet-spec-GL PFNet spec. - GL 5.43 6.98 1.62 midi-sin-nsf - - sine NSF 4.32 6.40 2.88 midi-noi-nsf - - noise NSF 4.40 6.08 2.63 Copy-synthesis MIDI-to-audio Wav NSF MIDI API Sine Wav NSF Noise Natural Natural Wav Acoustic model NSF MIDI API Sine Wav Acoustic model NSF MIDI API Noise
  • 16. 16 Experiments q Listening test: crowd-sourcing, 224 amateur participants Natural Fluidsynth Pianoteq abs-mfb-sin abs-mfb-noi abs-mel-sin abs-mel-noi taco2-mfb-sin taco2-mfb-noi taco3-mfb-sin taco3-mfb-noi taco4-mfb-sin taco4-mfb-noi pfnet-mfb-sin pfnet-mfb-noi pfnet-mel-sin pfnet-mel-noi pfnet-spec-GL midi-sin-nsf midi-noi-nsf 1 2 3 4 5 Quality (MOS) MIDI spectrogram Mel spectrogram Other / N/A
  • 17. 17 Experiments Mann-whitey-U, Holm-Boferroni correction: “*” statistical significance at alpha=0.05, “-” otherwise q Listening test: crowd-sourcing, 224 amateur participants Natural Fluidsynth Pianoteq abs-mfb-sin abs-mfb-noi abs-mel-sin abs-mel-noi taco2-mfb-sin taco2-mfb-noi taco3-mfb-sin taco3-mfb-noi taco4-mfb-sin taco4-mfb-noi pfnet-mfb-sin pfnet-mfb-noi pfnet-mel-sin pfnet-mel-noi pfnet-spec-GL midi-sin-nsf midi-noi-nsf 1 2 3 4 5 Quality (MOS) MIDI spectrogram Mel spectrogram Other / N/A * - <latexit sha1_base64="QvEzsRPQOsJO2uPs95Rt4gD1HG8=">AAAM5XiclVbdbuNEFPYuC9mGAFu45MaiYrUg2bId56d3i1i6RNotQaVppLqqxpNJYtX2BM9428jyI3CHkOCGV+EW3oC34cxMksY/SYVTy9PxOd985ztzzthfhAHjlvXvo8fvPXn/g8bTg+aHrY8+/uTZ4acjRtMEk3NMQ5qMfcRIGMTknAc8JONFQlDkh+TCv/lWvL94RxIW0PgnvlyQqwjN4mAaYMRh6vqw0fZ8MgvijCM/DVGSZ3h15U2P00WShqTpRWnIg4TeZk6efZ1nZ0vGSaQPXuX6c7380mNzmnDGEb7JvsE0ZTzAuufpEZ2QMP8/HlOCeJqQB32+u8MBN4UHC2YxeniRC/SOmHWcQM80ioUHzrNhwPFcjwIWIRg8hPn2hzOB+CIiKP4qzz2v6WGRlKxr9PKmXnM9lz/xB3dMOYEHBsCJWAnco2Ai1T8VKqAQZo0dt2tabsHFY0s2HozKcbl5FkL2ULQIiSF2zUR/O3g1MDg1UDoJqM7olN8ikFyw6piOJZ5ds9cTFNtmtytXEeA7sIfzJYO9FRpS2X3ortl3FHpHrJK5ptPJi0Egn8kYRIgwGRhTH0YMRIXH6dmJFE5poPj1exJA+A0qfjEN2Max6Nfb8ts4ktBgC2wWV9z2c8yes/Greu1Zr29XsjVSK3uc3HHOoRoxdfJdcR+AbiBY186Vgm1X0jnurdMzeghsm9yBzAbkVmF1JUW7v0m1XoJq7+aloHp9BeX2FdTxGmow3g9VZdU/Xu2RdglqXInQ3SsXQK0DPFZi3Qc4KEfoPkQLlFYROgqrcy88TAxPTgnfrZFUqWN2ZGg903ZVaNZGpVEtRoHFGkRJ3TMdlTXrnsg2htyUpLyXNyAq9cBERGWLulwzGQx2wJTIKC7uiktHidLeypWUeA20BhEl8fqN5AEIHdNtq7bYheTIKdvsOpVK2VSoUS5NSEx71VNcS1Vof5Plba9SaYKfu+p0Vl+6dRV3n3JOI7UyiSebM/L62ZFlWvLSqwN7NTjSVtfw+vDJ796E4jQiMcchYuzSthb8KkMJ9M+QwFGbMrKAYwTNyCUMYxQRdpXJ4z3Xv0xFp57SBO6Y63J22yNDEWPLyAdLOKrmrPxOTNa9u0z5tH+VBfEi5STGaqFpGuqc6uJbQZ8ECcE8XMIA4SQQxzKeowRhDl8UTS8mt5hGEQJlPJ/ll/YVPGk4EVxomB3ZeSEwFQ73w5rZ4tQsQYt5gO+Ksz6lN5ABVpxdH8l5s0hINGU4oVdVDf8ZYv9C5gWpsmHZcurvtKwBhU1YZzqqgtZYiq15byjLHZY2YjatsyybAuBO0y3bxTQmfJcAsuVUbOslEJ2lDrY+sFGVwk4RRtt85Ym1j8Og3nqHvuOibXsPcq1pPey4RMHdB1vm6+4FruZO9Ezj9Zta662MnIRpAPUXQ8HXWG4ZDgMkPjl/zkUzs8utqzoYOabdMa0f3aOX/VVbe6p9rn2hvdBsrae91L7Xhtq5hht/NP5q/N34pzVr/dL6tfWbMn38aOXzmVa4Wn/+B5czF8c=</latexit> System ID Acoustic model Acoustic feature Excit. signal Wave. model Pitch mismatch MOS (mean note chord Natural - - - - - - 4.04 Fluidsynth Sample-based MIDI-to-audio software 5.20 6.77 3.66 Pianoteq Physical-model MIDI-to-audio software 4.82 6.50 4.25 abs-mfb-sin - midi-fb sine NSF - - 3.87 abs-mfb-noi - midi-fb noise NSF - - 3.77 abs-mel-sin - mel-spc. sine NSF - - 2.72 abs-mel-noi - mel-spc. noise NSF - - 3.81 taco2-mfb-sin taco2 midi-fb sine NSF 4.61 6.34 2.97 taco2-mfb-noi taco2 midi-fb noise NSF 4.66 6.36 3.18 taco3-mfb-sin taco3 midi-fb sine NSF 4.78 6.48 3.19 taco3-mfb-noi taco3 midi-fb noise NSF 4.89 6.53 3.19 taco4-mfb-sin taco4 midi-fb sine NSF 4.86 6.39 2.98 taco4-mfb-noi taco4 midi-fb noise NSF 4.97 6.42 2.95 pfnet-mfb-sin PFNet midi-fb sine NSF 5.59 7.14 3.10 pfnet-mfb-noi PFNet midi-fb noise NSF 5.78 7.26 3.05 pfnet-mel-sin PFNet mel-spec. sine NSF 5.66 7.17 1.82 pfnet-mel-noi PFNet mel-spec. noise NSF 5.74 7.25 2.93 pfnet-spec-GL PFNet spec. - GL 5.43 6.98 1.62
  • 18. 18 Experiments Mann-whitey-U, Holm-Boferroni correction: “*” statistical significance at alpha=0.05, “-” otherwise q Listening test: crowd-sourcing, 224 amateur participants § Neural waveform model works well in copy-synthesis condition Natural Fluidsynth Pianoteq abs-mfb-sin abs-mfb-noi abs-mel-sin abs-mel-noi taco2-mfb-sin taco2-mfb-noi taco3-mfb-sin taco3-mfb-noi taco4-mfb-sin taco4-mfb-noi pfnet-mfb-sin pfnet-mfb-noi pfnet-mel-sin pfnet-mel-noi pfnet-spec-GL midi-sin-nsf midi-noi-nsf 1 2 3 4 5 Quality (MOS) MIDI spectrogram Mel spectrogram Other / N/A <latexit sha1_base64="QvEzsRPQOsJO2uPs95Rt4gD1HG8=">AAAM5XiclVbdbuNEFPYuC9mGAFu45MaiYrUg2bId56d3i1i6RNotQaVppLqqxpNJYtX2BM9428jyI3CHkOCGV+EW3oC34cxMksY/SYVTy9PxOd985ztzzthfhAHjlvXvo8fvPXn/g8bTg+aHrY8+/uTZ4acjRtMEk3NMQ5qMfcRIGMTknAc8JONFQlDkh+TCv/lWvL94RxIW0PgnvlyQqwjN4mAaYMRh6vqw0fZ8MgvijCM/DVGSZ3h15U2P00WShqTpRWnIg4TeZk6efZ1nZ0vGSaQPXuX6c7380mNzmnDGEb7JvsE0ZTzAuufpEZ2QMP8/HlOCeJqQB32+u8MBN4UHC2YxeniRC/SOmHWcQM80ioUHzrNhwPFcjwIWIRg8hPn2hzOB+CIiKP4qzz2v6WGRlKxr9PKmXnM9lz/xB3dMOYEHBsCJWAnco2Ai1T8VKqAQZo0dt2tabsHFY0s2HozKcbl5FkL2ULQIiSF2zUR/O3g1MDg1UDoJqM7olN8ikFyw6piOJZ5ds9cTFNtmtytXEeA7sIfzJYO9FRpS2X3ortl3FHpHrJK5ptPJi0Egn8kYRIgwGRhTH0YMRIXH6dmJFE5poPj1exJA+A0qfjEN2Max6Nfb8ts4ktBgC2wWV9z2c8yes/Greu1Zr29XsjVSK3uc3HHOoRoxdfJdcR+AbiBY186Vgm1X0jnurdMzeghsm9yBzAbkVmF1JUW7v0m1XoJq7+aloHp9BeX2FdTxGmow3g9VZdU/Xu2RdglqXInQ3SsXQK0DPFZi3Qc4KEfoPkQLlFYROgqrcy88TAxPTgnfrZFUqWN2ZGg903ZVaNZGpVEtRoHFGkRJ3TMdlTXrnsg2htyUpLyXNyAq9cBERGWLulwzGQx2wJTIKC7uiktHidLeypWUeA20BhEl8fqN5AEIHdNtq7bYheTIKdvsOpVK2VSoUS5NSEx71VNcS1Vof5Plba9SaYKfu+p0Vl+6dRV3n3JOI7UyiSebM/L62ZFlWvLSqwN7NTjSVtfw+vDJ796E4jQiMcchYuzSthb8KkMJ9M+QwFGbMrKAYwTNyCUMYxQRdpXJ4z3Xv0xFp57SBO6Y63J22yNDEWPLyAdLOKrmrPxOTNa9u0z5tH+VBfEi5STGaqFpGuqc6uJbQZ8ECcE8XMIA4SQQxzKeowRhDl8UTS8mt5hGEQJlPJ/ll/YVPGk4EVxomB3ZeSEwFQ73w5rZ4tQsQYt5gO+Ksz6lN5ABVpxdH8l5s0hINGU4oVdVDf8ZYv9C5gWpsmHZcurvtKwBhU1YZzqqgtZYiq15byjLHZY2YjatsyybAuBO0y3bxTQmfJcAsuVUbOslEJ2lDrY+sFGVwk4RRtt85Ym1j8Og3nqHvuOibXsPcq1pPey4RMHdB1vm6+4FruZO9Ezj9Zta662MnIRpAPUXQ8HXWG4ZDgMkPjl/zkUzs8utqzoYOabdMa0f3aOX/VVbe6p9rn2hvdBsrae91L7Xhtq5hht/NP5q/N34pzVr/dL6tfWbMn38aOXzmVa4Wn/+B5czF8c=</latexit> System ID Acoustic model Acoustic feature Excit. signal Wave. model Pitch mismatch MOS (mean note chord Natural - - - - - - 4.04 Fluidsynth Sample-based MIDI-to-audio software 5.20 6.77 3.66 Pianoteq Physical-model MIDI-to-audio software 4.82 6.50 4.25 abs-mfb-sin - midi-fb sine NSF - - 3.87 abs-mfb-noi - midi-fb noise NSF - - 3.77 abs-mel-sin - mel-spc. sine NSF - - 2.72 abs-mel-noi - mel-spc. noise NSF - - 3.81 taco2-mfb-sin taco2 midi-fb sine NSF 4.61 6.34 2.97 taco2-mfb-noi taco2 midi-fb noise NSF 4.66 6.36 3.18 taco3-mfb-sin taco3 midi-fb sine NSF 4.78 6.48 3.19 taco3-mfb-noi taco3 midi-fb noise NSF 4.89 6.53 3.19 taco4-mfb-sin taco4 midi-fb sine NSF 4.86 6.39 2.98 taco4-mfb-noi taco4 midi-fb noise NSF 4.97 6.42 2.95 pfnet-mfb-sin PFNet midi-fb sine NSF 5.59 7.14 3.10 pfnet-mfb-noi PFNet midi-fb noise NSF 5.78 7.26 3.05 pfnet-mel-sin PFNet mel-spec. sine NSF 5.66 7.17 1.82 pfnet-mel-noi PFNet mel-spec. noise NSF 5.74 7.25 2.93 pfnet-spec-GL PFNet spec. - GL 5.43 6.98 1.62 * *
  • 19. 19 Experiments q Listening test: crowd-sourcing, 224 amateur participants Natural Fluidsynth Pianoteq abs-mfb-sin abs-mfb-noi abs-mel-sin abs-mel-noi taco2-mfb-sin taco2-mfb-noi taco3-mfb-sin taco3-mfb-noi taco4-mfb-sin taco4-mfb-noi pfnet-mfb-sin pfnet-mfb-noi pfnet-mel-sin pfnet-mel-noi pfnet-spec-GL midi-sin-nsf midi-noi-nsf 1 2 3 4 5 Quality (MOS) MIDI spectrogram Mel spectrogram Other / N/A <latexit sha1_base64="QvEzsRPQOsJO2uPs95Rt4gD1HG8=">AAAM5XiclVbdbuNEFPYuC9mGAFu45MaiYrUg2bId56d3i1i6RNotQaVppLqqxpNJYtX2BM9428jyI3CHkOCGV+EW3oC34cxMksY/SYVTy9PxOd985ztzzthfhAHjlvXvo8fvPXn/g8bTg+aHrY8+/uTZ4acjRtMEk3NMQ5qMfcRIGMTknAc8JONFQlDkh+TCv/lWvL94RxIW0PgnvlyQqwjN4mAaYMRh6vqw0fZ8MgvijCM/DVGSZ3h15U2P00WShqTpRWnIg4TeZk6efZ1nZ0vGSaQPXuX6c7380mNzmnDGEb7JvsE0ZTzAuufpEZ2QMP8/HlOCeJqQB32+u8MBN4UHC2YxeniRC/SOmHWcQM80ioUHzrNhwPFcjwIWIRg8hPn2hzOB+CIiKP4qzz2v6WGRlKxr9PKmXnM9lz/xB3dMOYEHBsCJWAnco2Ai1T8VKqAQZo0dt2tabsHFY0s2HozKcbl5FkL2ULQIiSF2zUR/O3g1MDg1UDoJqM7olN8ikFyw6piOJZ5ds9cTFNtmtytXEeA7sIfzJYO9FRpS2X3ortl3FHpHrJK5ptPJi0Egn8kYRIgwGRhTH0YMRIXH6dmJFE5poPj1exJA+A0qfjEN2Max6Nfb8ts4ktBgC2wWV9z2c8yes/Greu1Zr29XsjVSK3uc3HHOoRoxdfJdcR+AbiBY186Vgm1X0jnurdMzeghsm9yBzAbkVmF1JUW7v0m1XoJq7+aloHp9BeX2FdTxGmow3g9VZdU/Xu2RdglqXInQ3SsXQK0DPFZi3Qc4KEfoPkQLlFYROgqrcy88TAxPTgnfrZFUqWN2ZGg903ZVaNZGpVEtRoHFGkRJ3TMdlTXrnsg2htyUpLyXNyAq9cBERGWLulwzGQx2wJTIKC7uiktHidLeypWUeA20BhEl8fqN5AEIHdNtq7bYheTIKdvsOpVK2VSoUS5NSEx71VNcS1Vof5Plba9SaYKfu+p0Vl+6dRV3n3JOI7UyiSebM/L62ZFlWvLSqwN7NTjSVtfw+vDJ796E4jQiMcchYuzSthb8KkMJ9M+QwFGbMrKAYwTNyCUMYxQRdpXJ4z3Xv0xFp57SBO6Y63J22yNDEWPLyAdLOKrmrPxOTNa9u0z5tH+VBfEi5STGaqFpGuqc6uJbQZ8ECcE8XMIA4SQQxzKeowRhDl8UTS8mt5hGEQJlPJ/ll/YVPGk4EVxomB3ZeSEwFQ73w5rZ4tQsQYt5gO+Ksz6lN5ABVpxdH8l5s0hINGU4oVdVDf8ZYv9C5gWpsmHZcurvtKwBhU1YZzqqgtZYiq15byjLHZY2YjatsyybAuBO0y3bxTQmfJcAsuVUbOslEJ2lDrY+sFGVwk4RRtt85Ym1j8Og3nqHvuOibXsPcq1pPey4RMHdB1vm6+4FruZO9Ezj9Zta662MnIRpAPUXQ8HXWG4ZDgMkPjl/zkUzs8utqzoYOabdMa0f3aOX/VVbe6p9rn2hvdBsrae91L7Xhtq5hht/NP5q/N34pzVr/dL6tfWbMn38aOXzmVa4Wn/+B5czF8c=</latexit> System ID Acoustic model Acoustic feature Excit. signal Wave. model P n Natural - - - - Fluidsynth Sample-based MIDI-to-audio software 5 Pianoteq Physical-model MIDI-to-audio software 4 abs-mfb-sin - midi-fb sine NSF abs-mfb-noi - midi-fb noise NSF abs-mel-sin - mel-spc. sine NSF abs-mel-noi - mel-spc. noise NSF taco2-mfb-sin taco2 midi-fb sine NSF 4 taco2-mfb-noi taco2 midi-fb noise NSF 4 taco3-mfb-sin taco3 midi-fb sine NSF 4 taco3-mfb-noi taco3 midi-fb noise NSF 4 taco4-mfb-sin taco4 midi-fb sine NSF 4 taco4-mfb-noi taco4 midi-fb noise NSF 4 pfnet-mfb-sin PFNet midi-fb sine NSF 5 pfnet-mfb-noi PFNet midi-fb noise NSF 5 pfnet-mel-sin PFNet mel-spec. sine NSF 5 pfnet-mel-noi PFNet mel-spec. noise NSF 5 pfnet-spec-GL PFNet spec. - GL 5 • Taco2: 800-frame segments; 4x downsampling for better alignments • Taco3: Warm-started from taco2; input current piano-roll frame at the decoder pre-net • Taco4: Warm-started from taco2; no downsampling or piano-roll input to pre-net * *
  • 20. 20 Experiments q Listening test: crowd-sourcing, 224 amateur participants q Pitch distortion § Natural audio as target § Evaluated on single notes § The lower the better Natural Fluidsynth Pianoteq abs-mfb-sin abs-mfb-noi abs-mel-sin abs-mel-noi taco2-mfb-sin taco2-mfb-noi taco3-mfb-sin taco3-mfb-noi taco4-mfb-sin taco4-mfb-noi pfnet-mfb-sin pfnet-mfb-noi pfnet-mel-sin pfnet-mel-noi pfnet-spec-GL midi-sin-nsf midi-noi-nsf 1 2 3 4 5 Quality (MOS) MIDI spectrogram Mel spectrogram Other / N/A 4 4.2 4.4 4.6 4.8 5 5.2 5.4 5.6 5.8 6
  • 21. 21 Messages q TTS & MIDI-to-audio § Techniques can be shared: acoustic model, waveform model § Performance bottleneck on acoustic model (and waveform model) q On waveform modeling § Physical-model performs well but lacks reverberation effect § Sample-based model replies on the sample database § Non-AR waveform model is OK in copy-synthesis • Reverberation is captured • Noise excitation is OK
  • 22. 22 Messages q On acoustic model § Obtaining good alignments for longer input sequences is challenging § Inputting the piano-roll frame to the decoder prenet helps improve alignments • Acceptable for perfectly-aligned performance-MIDI • Have to consider other strategies for non-aligned score-MIDI
  • 23. 23 Thank you! Tacotron code: https://github.com/nii-yamagishilab/self-attention-tacotron NSF code: https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts Samples: https://nii-yamagishilab.github.io/samples-xin/main-midi2audio.html
  • 24. 25 Appendix q On MIDI filterbank Center of MIDI filter bank Mel filter banks
  • 25. 26 Appendix q On MIDI filterbank Center of MIDI filter bank Mel filter banks
  • 26. 27 Appendix q On MIDI filterbank Mel-spectrogram MIDI-centered filter-bank CQT
  • 27. 28 Appendix q On MIDI filterbank Natural Mel- spectrogram to wave MIDI- centered filter-bank CQT Short clip 1 Short clip 2 Short clip 3 Short clip 4
  • 28. 29 Appendix https://www.fluidsynth.org/ https://www.modartt.com/pianoteq q Models in comparison <latexit sha1_base64="QvEzsRPQOsJO2uPs95Rt4gD1HG8=">AAAM5XiclVbdbuNEFPYuC9mGAFu45MaiYrUg2bId56d3i1i6RNotQaVppLqqxpNJYtX2BM9428jyI3CHkOCGV+EW3oC34cxMksY/SYVTy9PxOd985ztzzthfhAHjlvXvo8fvPXn/g8bTg+aHrY8+/uTZ4acjRtMEk3NMQ5qMfcRIGMTknAc8JONFQlDkh+TCv/lWvL94RxIW0PgnvlyQqwjN4mAaYMRh6vqw0fZ8MgvijCM/DVGSZ3h15U2P00WShqTpRWnIg4TeZk6efZ1nZ0vGSaQPXuX6c7380mNzmnDGEb7JvsE0ZTzAuufpEZ2QMP8/HlOCeJqQB32+u8MBN4UHC2YxeniRC/SOmHWcQM80ioUHzrNhwPFcjwIWIRg8hPn2hzOB+CIiKP4qzz2v6WGRlKxr9PKmXnM9lz/xB3dMOYEHBsCJWAnco2Ai1T8VKqAQZo0dt2tabsHFY0s2HozKcbl5FkL2ULQIiSF2zUR/O3g1MDg1UDoJqM7olN8ikFyw6piOJZ5ds9cTFNtmtytXEeA7sIfzJYO9FRpS2X3ortl3FHpHrJK5ptPJi0Egn8kYRIgwGRhTH0YMRIXH6dmJFE5poPj1exJA+A0qfjEN2Max6Nfb8ts4ktBgC2wWV9z2c8yes/Greu1Zr29XsjVSK3uc3HHOoRoxdfJdcR+AbiBY186Vgm1X0jnurdMzeghsm9yBzAbkVmF1JUW7v0m1XoJq7+aloHp9BeX2FdTxGmow3g9VZdU/Xu2RdglqXInQ3SsXQK0DPFZi3Qc4KEfoPkQLlFYROgqrcy88TAxPTgnfrZFUqWN2ZGg903ZVaNZGpVEtRoHFGkRJ3TMdlTXrnsg2htyUpLyXNyAq9cBERGWLulwzGQx2wJTIKC7uiktHidLeypWUeA20BhEl8fqN5AEIHdNtq7bYheTIKdvsOpVK2VSoUS5NSEx71VNcS1Vof5Plba9SaYKfu+p0Vl+6dRV3n3JOI7UyiSebM/L62ZFlWvLSqwN7NTjSVtfw+vDJ796E4jQiMcchYuzSthb8KkMJ9M+QwFGbMrKAYwTNyCUMYxQRdpXJ4z3Xv0xFp57SBO6Y63J22yNDEWPLyAdLOKrmrPxOTNa9u0z5tH+VBfEi5STGaqFpGuqc6uJbQZ8ECcE8XMIA4SQQxzKeowRhDl8UTS8mt5hGEQJlPJ/ll/YVPGk4EVxomB3ZeSEwFQ73w5rZ4tQsQYt5gO+Ksz6lN5ABVpxdH8l5s0hINGU4oVdVDf8ZYv9C5gWpsmHZcurvtKwBhU1YZzqqgtZYiq15byjLHZY2YjatsyybAuBO0y3bxTQmfJcAsuVUbOslEJ2lDrY+sFGVwk4RRtt85Ym1j8Og3nqHvuOibXsPcq1pPey4RMHdB1vm6+4FruZO9Ezj9Zta662MnIRpAPUXQ8HXWG4ZDgMkPjl/zkUzs8utqzoYOabdMa0f3aOX/VVbe6p9rn2hvdBsrae91L7Xhtq5hht/NP5q/N34pzVr/dL6tfWbMn38aOXzmVa4Wn/+B5czF8c=</latexit> System ID Acoustic model Acoustic feature Excit. signal Wave. model Pitch mismatch MOS (mean) note chord Natural - - - - - - 4.04 Fluidsynth Sample-based MIDI-to-audio software 5.20 6.77 3.66 Pianoteq Physical-model MIDI-to-audio software 4.82 6.50 4.25 abs-mfb-sin - midi-fb sine NSF - - 3.87 abs-mfb-noi - midi-fb noise NSF - - 3.77 abs-mel-sin - mel-spc. sine NSF - - 2.72 abs-mel-noi - mel-spc. noise NSF - - 3.81 taco2-mfb-sin taco2 midi-fb sine NSF 4.61 6.34 2.97 taco2-mfb-noi taco2 midi-fb noise NSF 4.66 6.36 3.18 taco3-mfb-sin taco3 midi-fb sine NSF 4.78 6.48 3.19 taco3-mfb-noi taco3 midi-fb noise NSF 4.89 6.53 3.19 taco4-mfb-sin taco4 midi-fb sine NSF 4.86 6.39 2.98 taco4-mfb-noi taco4 midi-fb noise NSF 4.97 6.42 2.95 pfnet-mfb-sin PFNet midi-fb sine NSF 5.59 7.14 3.10 pfnet-mfb-noi PFNet midi-fb noise NSF 5.78 7.26 3.05 pfnet-mel-sin PFNet mel-spec. sine NSF 5.66 7.17 1.82 pfnet-mel-noi PFNet mel-spec. noise NSF 5.74 7.25 2.93 pfnet-spec-GL PFNet spec. - GL 5.43 6.98 1.62 midi-sin-nsf - - sine NSF 4.32 6.40 2.88 midi-noi-nsf - - noise NSF 4.40 6.08 2.63 MIDI-to-audio Wav Acoustic model NSF MIDI API Sine Wav Acoustic model NSF MIDI API Noise MIDI-to-audio Wav NSF MIDI API Sine Wav NSF MIDI API Noise
  • 29. 30 Appendix https://www.fluidsynth.org/ https://www.modartt.com/pianoteq <latexit sha1_base64="QvEzsRPQOsJO2uPs95Rt4gD1HG8=">AAAM5XiclVbdbuNEFPYuC9mGAFu45MaiYrUg2bId56d3i1i6RNotQaVppLqqxpNJYtX2BM9428jyI3CHkOCGV+EW3oC34cxMksY/SYVTy9PxOd985ztzzthfhAHjlvXvo8fvPXn/g8bTg+aHrY8+/uTZ4acjRtMEk3NMQ5qMfcRIGMTknAc8JONFQlDkh+TCv/lWvL94RxIW0PgnvlyQqwjN4mAaYMRh6vqw0fZ8MgvijCM/DVGSZ3h15U2P00WShqTpRWnIg4TeZk6efZ1nZ0vGSaQPXuX6c7380mNzmnDGEb7JvsE0ZTzAuufpEZ2QMP8/HlOCeJqQB32+u8MBN4UHC2YxeniRC/SOmHWcQM80ioUHzrNhwPFcjwIWIRg8hPn2hzOB+CIiKP4qzz2v6WGRlKxr9PKmXnM9lz/xB3dMOYEHBsCJWAnco2Ai1T8VKqAQZo0dt2tabsHFY0s2HozKcbl5FkL2ULQIiSF2zUR/O3g1MDg1UDoJqM7olN8ikFyw6piOJZ5ds9cTFNtmtytXEeA7sIfzJYO9FRpS2X3ortl3FHpHrJK5ptPJi0Egn8kYRIgwGRhTH0YMRIXH6dmJFE5poPj1exJA+A0qfjEN2Max6Nfb8ts4ktBgC2wWV9z2c8yes/Greu1Zr29XsjVSK3uc3HHOoRoxdfJdcR+AbiBY186Vgm1X0jnurdMzeghsm9yBzAbkVmF1JUW7v0m1XoJq7+aloHp9BeX2FdTxGmow3g9VZdU/Xu2RdglqXInQ3SsXQK0DPFZi3Qc4KEfoPkQLlFYROgqrcy88TAxPTgnfrZFUqWN2ZGg903ZVaNZGpVEtRoHFGkRJ3TMdlTXrnsg2htyUpLyXNyAq9cBERGWLulwzGQx2wJTIKC7uiktHidLeypWUeA20BhEl8fqN5AEIHdNtq7bYheTIKdvsOpVK2VSoUS5NSEx71VNcS1Vof5Plba9SaYKfu+p0Vl+6dRV3n3JOI7UyiSebM/L62ZFlWvLSqwN7NTjSVtfw+vDJ796E4jQiMcchYuzSthb8KkMJ9M+QwFGbMrKAYwTNyCUMYxQRdpXJ4z3Xv0xFp57SBO6Y63J22yNDEWPLyAdLOKrmrPxOTNa9u0z5tH+VBfEi5STGaqFpGuqc6uJbQZ8ECcE8XMIA4SQQxzKeowRhDl8UTS8mt5hGEQJlPJ/ll/YVPGk4EVxomB3ZeSEwFQ73w5rZ4tQsQYt5gO+Ksz6lN5ABVpxdH8l5s0hINGU4oVdVDf8ZYv9C5gWpsmHZcurvtKwBhU1YZzqqgtZYiq15byjLHZY2YjatsyybAuBO0y3bxTQmfJcAsuVUbOslEJ2lDrY+sFGVwk4RRtt85Ym1j8Og3nqHvuOibXsPcq1pPey4RMHdB1vm6+4FruZO9Ezj9Zta662MnIRpAPUXQ8HXWG4ZDgMkPjl/zkUzs8utqzoYOabdMa0f3aOX/VVbe6p9rn2hvdBsrae91L7Xhtq5hht/NP5q/N34pzVr/dL6tfWbMn38aOXzmVa4Wn/+B5czF8c=</latexit> System ID Acoustic model Acoustic feature Excit. signal Wave. model Pitch mismatch MOS (mean) note chord Natural - - - - - - 4.04 Fluidsynth Sample-based MIDI-to-audio software 5.20 6.77 3.66 Pianoteq Physical-model MIDI-to-audio software 4.82 6.50 4.25 abs-mfb-sin - midi-fb sine NSF - - 3.87 abs-mfb-noi - midi-fb noise NSF - - 3.77 abs-mel-sin - mel-spc. sine NSF - - 2.72 abs-mel-noi - mel-spc. noise NSF - - 3.81 taco2-mfb-sin taco2 midi-fb sine NSF 4.61 6.34 2.97 taco2-mfb-noi taco2 midi-fb noise NSF 4.66 6.36 3.18 taco3-mfb-sin taco3 midi-fb sine NSF 4.78 6.48 3.19 taco3-mfb-noi taco3 midi-fb noise NSF 4.89 6.53 3.19 taco4-mfb-sin taco4 midi-fb sine NSF 4.86 6.39 2.98 taco4-mfb-noi taco4 midi-fb noise NSF 4.97 6.42 2.95 pfnet-mfb-sin PFNet midi-fb sine NSF 5.59 7.14 3.10 pfnet-mfb-noi PFNet midi-fb noise NSF 5.78 7.26 3.05 pfnet-mel-sin PFNet mel-spec. sine NSF 5.66 7.17 1.82 pfnet-mel-noi PFNet mel-spec. noise NSF 5.74 7.25 2.93 pfnet-spec-GL PFNet spec. - GL 5.43 6.98 1.62 midi-sin-nsf - - sine NSF 4.32 6.40 2.88 midi-noi-nsf - - noise NSF 4.40 6.08 2.63 q Objective test § Lower the better
  • 30. 31 Appendix q Significance test § Grey block: significant difference § White block: otherwise Natural Fluidsynth Pianoteq abs-mfb-sin abs-mfb-noi abs-mel-sin abs-mel-noi taco2-mfb-sin taco2-mfb-noi taco3-mfb-sin taco3-mfb-noi taco4-mfb-sin taco4-mfb-noi pfnet-mfb-sin pfnet-mfb-noi pfnet-mel-sin pfnet-mel-noi pfnet-spec-GL midi-sin-nsf midi-noi-nsf Natural Fluidsynth Pianoteq abs-mfb-sin abs-mfb-noi abs-mel-sin abs-mel-noi taco2-mfb-sin taco2-mfb-noi taco3-mfb-sin taco3-mfb-noi taco4-mfb-sin taco4-mfb-noi pfnet-mfb-sin pfnet-mfb-noi pfnet-mel-sin pfnet-mel-noi pfnet-spec-GL midi-sin-nsf midi-noi-nsf
  • 31. 32 Appendix Caetano, Marcelo, and Xavier Rodet. "A source-filter model for musical instrument sound transformation." ICASSP. IEEE, 2012. Klapuri, Anssi, Tuomas Virtanen, and Toni Heittola. "Sound source separation in monaural music signals using excitation-filter model and em algorithm." ICASSP. IEEE, 2010. q On music audio and speech waveform
  • 32. 33 Appendix Caetano, Marcelo, and Xavier Rodet. "A source-filter model for musical instrument sound transformation." ICASSP. IEEE, 2012. Klapuri, Anssi, Tuomas Virtanen, and Toni Heittola. "Sound source separation in monaural music signals using excitation-filter model and em algorithm." ICASSP. IEEE, 2010. q On music audio and speech waveform
  • 33. 34 Appendix q On music audio and speech waveform