Real-Time Vowel Synthesis - A Magnetic Resonator Piano Based Project_by_Vasileios_Valavanis

1. Queen Mary, University of London Master's Project Real-Time Vowel Synthesis: A Magnetic Resonator Piano Based Project Author: Vasileios Valavanis Supervisor: Andrew McPherson A thesis submitted in ful

2. lment of the requirements for the degree of Master of Science in the School of Electronic Engineering and Computer Science Queen Mary, University of London August 2014

3. If a picture paints a thousand words, then a naked picture paints a thousand words without any vowels" Josh Stern

4. Abstract Speech synthesis has been an important

5. eld of research since the beginning of the digital signal processing era. Vowels make words intelligible and as Josh Stern so eloquently quoted, without them words are naked. This project aims to explore the development of a real time vowel synthesis system based on a medium that no conventional systems use. Dr McPherson's magnetic resonator piano was used in order to vibrate its strings in such way so that they generate vowels. This paper walks the reader through the thorough investigation on the properties of the human voice, the spectral analysis the magnetic resonator piano's structure and the implementation of this vowel synthesis system that includes a software synthesiser developed by the author. Results, potential improvements and expansions are discussed. ii

6. Acknowledgements I would like to thank my project supervisor, Dr Andrew McPherson, for giving me the opportunity to work on one of the most fascinating subjects in the

7. eld of audio synthesis and computer science, for allowing me to use the magnetic resonator piano and also for his consistent support and guidance throughout the project and for putting up with my constant emailing and pestering. iii

8. Contents List of Figures v 1 Introduction 1 1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Paper Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Literature Review 5 2.1 The Magnetic Resonator Piano (MRP) . . . . . . . . . . . . . . . . 5 2.1.1 MRP Signal Flow . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 The Human Voice . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Anatomy of the Human Voice . . . . . . . . . . . . . . . . . 8 2.2.2 Mechanics of the Human Voice . . . . . . . . . . . . . . . . 11 2.3 Speech Synthesis Models . . . . . . . . . . . . . . . . . . . . . . . . 17 3 Implementation 23 3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Spectral Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5 Plugin Parameters Description . . . . . . . . . . . . . . . . . . . . . 36 3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4 Conclusion 43 4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4 Summary and Final Thoughts . . . . . . . . . . . . . . . . . . . . . 48 Bibliography 49 iv

9. List of Figures 2.1 MRP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Key Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 U.R.S. Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Vocal Cords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Glottal Pulses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.6 Glottal Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.7 Formant

10. ltering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.8 Source-

11. lter model . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.9 Articulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.10 Concatenative synthesis . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1 PRAAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Impulse Train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Magnitude Response . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Vowel A Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5 C4 Frequency Response . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6 G3 Frequency Response . . . . . . . . . . . . . . . . . . . . . . . . 34 3.7 G3 Average Frequency Response . . . . . . . . . . . . . . . . . . . . 35 3.8 Plugin Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.9 1st result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.10 2nd Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.11 3rd Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.12 4th Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.13 5th Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.14 6th Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 v

12. Chapter 1 Introduction 1.1 History The arti

13. cial recreation of the human voice has been a subject of study long before the digital age. Papers from the late 1700's exploring into the generation of vowels and the synthesis of consonants and their implementation, indicate just how early scientists showed interest in the subject. [1] Almost a century later, in 1930, Bell labs created the "vocoder", which stripped down speech into its fundamental frequency and its harmonics and within the next decade Homer Dudley developed a keyboard based voice synthesiser. [2] With the evolution of electronic and digital signal processing came more advanced systems. In 1961 Bell labs once more, created an electronic speech synthesis system using an IBM 704 computer and in the early 70's the TSI Speech+ was developed by Handheld electronics which was a breakthrough in portable speech calculators 1

14. Chapter 1. Introduction 2 for blind.[2] One of the most brilliant minds of the 21st century, Stephen Hawking, is also using a Speech+ series system to communicate due to his severe medical condition which has rendered him unable to speak. Nowadays, the majority of these technologies are computer based but there is still signi

15. cant need for mechanical implementations in the market. [2] 1.2 Background Digital imitation of speech, both as a concept and as an area of study, has the potential to drive the advancement of a variety of industries forward. Some of the current industries experimenting with this technology have seen a signi

16. cant growth since its successful development and have pushed its methodology towards new boundaries. Medical sciences, education, music, gaming platforms and many other

17. elds have created a substantial number of speech synthesis techniques. All of the techniques rely on understanding how the human voice works and the correct use of the tools available. The functionality of an electronic or digital audio synthesiser is quite simple. It involves the generation of electric or digital signals which represent waveforms and their conversion to audible signals through speakers. A few of the most popular sound synthesis methods are additive synthesis, subtractive synthesis and wavetable synthesis.[3] All of these methods, including the ones not mentioned, describe the generation of non audible signals before the stage of their transmission however what most waveform synthesis systems fail to elaborate on is the actual

18. Chapter 1. Introduction 3 medium of sound reproduction. Modern synthesisers are in a way restricted to transmit their output through speakers. This project is willing to take a dierent angle and explore vowel synthesis transmitted via piano strings. The quality of any arti

19. cial system depends on its approximation to the system it is modelling. In the search for a perfect manmade speech processor the research on vowel generation through piano strings described in this paper has resulted in a new speech synthesis method. The question raised by the author is whether it is possible to develop a vowel synthesis system and transmit its output through piano strings so that intelligible vowels are generated. 1.3 Paper Structure The remainder of this paper is structured as follows: • Chapter 2 contains a literature review covering the mechanics of the magnetic resonator piano developed by Dr Andrew McPherson, the physics of the human voice and existing speech synthesis methods. • Chapter 3 includes the proposed method and its implementation (including the design process) regarding the real time vowel synthesis system discussed and the description of its parameters. • Chapter 4 concludes the project with a discussion of the results in the context of what is examined in the literature review, an evaluation and

20. Chapter 1. Introduction 4 discussion of the future work, and

21. nally a summary of the project along with the author's

22. nal thoughts.

23. Chapter 2 Literature Review 2.1 The Magnetic Resonator Piano (MRP) The magnetic resonator piano can be considered to be an acoustic instrument with electronic prosthetics. As a project on its own, the MRP takes the traditional grand piano and extends its capabilities. Its success is based on the electromagnetic actuators attached above each piano string creating an electromagnetic

24. eld strong enough to force the strings to vibrate. This vibration allows the inde

25. nite sustain of any note whilst gives the performer control over amplitude, frequency and timbre in real time. What is remarkable about the MRP is that it is perfectly audible with its existing acoustic structure and without any ampli

26. cation or use of loudspeakers. All 88 keys of the piano are usable whereas the normal performance of the piano remains intact despite all the instalments. [4] 5

27. Chapter 2. Literature Review 6 Figure 2.1: Magnetic Resonator Piano Block Diagram [4] 2.1.1 MRP Signal Flow When in operation, the MRP can be divided into three main processes that consist its work ow. 1. The

28. rst task of the system is to receive an audio signal from a computer as seen in

29. gure 2.1. The most updated version of the MRP, and the one used for this project, neglects the sensing of the string vibration by the pickup as well as the feedback based mechanism that the band pass

30. lters and PLL create. The input feeds the audio ampli

31. ers that drive each actuator directly. [5] 2. Triggering the actuators is the next stage of operation. It is essential to mention here that the computer uses a MAX Msp patch to guide outgoing signals towards the string of the user's choice. The detection of which key

32. Chapter 2. Literature Review 7 is active is made by a continuous key sensing mechanism. A modi

33. ed Moog Piano Bar is used for this speci

34. c task. The Piano bar uses optical and interrupt sensors above each piano key, so that it keeps track of their motion. The actuators are triggered with a slow pressing movement of any key and remain active as long as the hammer does not touch the string and the key is not back to its default position. [5] 3. The last process of this machine is the actuation. The electromagnets above each string generate a

35. eld in which piano strings vibrate. Being driven by audio ampli

36. ers, the actuators force a vibration which is in phase with the actual audio input a process that results in a better spectral presence of the output. [5] Figure 2.2: Key sensing [5] To sum up and simplify, the magnetic resonator piano receives any audio signal from a computer and transmits it as accurately as possible keeping the string

37. Chapter 2. Literature Review 8 vibrations in harmony. Limitations and non linearities are of course inevitable and will be discussed in later chapters. 2.2 The Human Voice 2.2.1 Anatomy of the Human Voice Most of us are oblivious as to how many parts of our body are required to com- plete certain tasks like generating voice, which is the case for most of our bodily functions.[6] An examination of the anatomy of the upper respiratory system re- veals the complexity of such process whilst clarifying essential notions towards its correct physical modelling. Our body is such an ecient machine that uses the same organs, muscles and bones for a vast variety of dierent functions. This paper will focus on how these body parts are related to the production of sound rather than their general use. Figures 2.3 and 2.4 show where these key body parts are located. • The pharynx is a part of the whole throat area located just behind the oral and nasal cavity. It is used for producing sound and it has the ability to split into two muscular tubes. [8] • The larynx is an organ that helps breathing and is essential to sound creation because of its ability to manipulate pitch and volume. [9]

38. Chapter 2. Literature Review 9 Figure 2.3: Upper Respiratory System Diagram [7] • The trachea sends air into the lungs. It is open almost all the time and is located right below the larynx. [9] • The epiglottis is an elastic cartilage ap attached to the entrance of the larynx. It covers the larynx and works as a valve when we eat or drink. [10] • The esophagus allows food to pass from the pharynx into the stomach. It is only open when you swallow or vomit. [8] • The vocal cords or folds make phonation possible.They are twin mem- branes of muscles and ligaments covered in mucus and are located where the pharynx splits between the trachea and the oesophagus stretching horizon- tally across the larynx. They are no bigger than 22mm and are open during

39. Chapter 2. Literature Review 10 Figure 2.4: Vocal Cords [11] inhalation, closed when you hold your breath and vibrate when you speak or sing.[10] • The glottis is the combination of the vocal folds and the space in between the folds. [12] The aforementioned body parts put together the most sophisticated musical instrument in existence. To the disappointment of some readers, it is well shown that the vocal cords are not a set of strings vibrating like a guitar or a piano thus producing sound. Their function is to allow, block or partially block air traveling from the lungs through the trachea. The anatomy of the vocal system unveils key activities of certain organs that make the voice processing research more accurate.

40. Chapter 2. Literature Review 11 2.2.2 Mechanics of the Human Voice A view of the human vocal system as a musical instrument has helped the author examine the mechanics behind producing voice in a technical way. This section will look into how the generation of vowels depends on the interplay between key body parts. The human respiratory system works like a string and a wind instrument simultaneously.[8] This complex apparatus is broken down into three major sub systems that explain its function thoroughly. The major active processes responsible for the generation of vowels are: respiration, phonation and resonation. 1. Respiration The

41. rst component of the voice instrument is the lungs. We can consider the lungs as the source of the kinetic energy that is responsible for sound generation since air is the medium in which sound propagates. When we inhale, air is being stored temporarily into the lungs in order to oxygenate the blood. To initiate speech, air from the lungs is forced through the trachea and the other vocal mechanisms before it exits our body. While speaking, breathing becomes faster and inhalation occurs mostly from the mouth. [8] One control parameter that this component provides is the volume gain of the produced sound. The force by which we contract our lungs while we speak controls the pressure within the lungs. When the pressure of the lungs is either higher or lower than the atmospheric pressure air starts owing. To exhale we simply use certain muscles to decrease our lungs' capacity thus

42. Chapter 2. Literature Review 12 increase air pressure which results in expiration. The higher the velocity of the air ow through the vocal tract the greater the amplitude of the sound we produce is. [9] 2. Phonation The next segment of the vocal system concerns the actual generation of sound. Phonation is the process of the conversion of air into audible sound waves. As air travels through the trachea it inevitably meets the larynx and the vocal cords located at its base. The vocal cords work as a gate which regulates the air ow from and towards the oral and nasal cavities. The ability of this gate to remain partially open while interrupting the air coming from the lungs is what makes it function as a vibrator. [8] Certain aerodynamic and myeoelastic phenomena drive the vibration process of the vocal cords. Under the pressure of the pulmonary air ow, the vocal cords separate whereas due to a combination of factors, including elasticity, laryngeal muscle tension, and the Bernoulli eect the vocal folds close rapidly. [13] As the top of the folds is opening, the bottom is in the process of closing, and as soon as the top is closed, the pressure buildup below the glottis begins to open the bottom. If the process is maintained by a steady supply of pressurised air, the vocal cords will continue to open and close in a quasi-periodic fashion. Each vibration allows a small amount of air to escape, producing an audible sound at the frequency of this movement; this process generates voice. [13]

43. Chapter 2. Literature Review 13 Figure 2.5: Pulses created by the vocal cord vibrations [14] Figure 2.6: Glottal source spectrum [15]

44. Chapter 2. Literature Review 14 The frequency of this vibration sets the fundamental frequency of the glottal source and determines the perceived pitch of the voice.[16] The resulting waveform of this process is a periodic pulsating signal with high energy at the fundamental frequency and its harmonics and gradually decreasing amplitude across the spectrum as shown in

45. gure 2.6. [6] Like all string instruments, the pitch of the sound generated by the folds depends on their mass, length and tension. fx = nv 2L where: v = s T (2.1) (T is tension, L is string length and μ is mass per unit length.) However, being organic and air ow dependant the vocal folds distinguish themselves in terms of functionality. The length of the vocal cords can vary between 17- 22 mm for an adult male and 12-17 for an adult female. These numbers mathematically explain why women most commonly have a higher pitched voice than men. The fundamental frequency range of the glottal source is commonly between 50 - 500 Hz for all human beings. [17] Controlling the pitch of our voice is one of the most important features we have towards communicating with one another. While the pressure of the pulmonary air ow can be one method of controlling voice pitch the primary mechanism of doing so resides into the larynx. The muscles within the larynx give us control over the elasticity and the tension of the vocal folds.

46. Chapter 2. Literature Review 15 By manipulating these characteristics we practically adjust the fundamental frequency of the glottal source that is being generated. [16] 3. Resonation In the

47. nal stage of the voice generation process, the glottal source is being shaped into intelligible vowels and consonants making up words. The

48. rst step of this transformation occurs into the cavity of the pharynx which communicates with the nasal, the oral cavity and the larynx. The pulses of air that escaped the vocal cords are being diused into all of these cavities that have the role of the resonator of our whole vocal system. The moving parts of our resonator give us the ability to shape the waveforms transmitted. The lips, the tongue, the velum and basically all of our facial muscles give us dynamic control over the real time

49. ltering of the glottal source. The resonator's task is to attenuate some frequencies of the band limited pulse produced by the vocal folds while amplifying others. Despite the fact that without the vocal folds we could have no voice whatsoever, it is this section of the whole mechanism that makes our voice versatile,interesting and iden- ti

50. able. The nature of the human DNA makes sure that each person has dierent dimensions of the aforementioned cavities and muscles. Therefore the

51. ltering of the glottal source that occurs in the resonation stage, is as unique as one's

52. ngerprints. [18] At this point it is important to mention that speech is made up from more than one types of sound. The previous section examined the generation of voiced sound but speech contains unvoiced and plosive sounds as well.

53. Chapter 2. Literature Review 16 Unvoiced sounds result when air gets through certain blockades in the oral cavity whereas plosive sounds are sudden bursts of air coming from either the abrupt movement of the vocal tract or the mouth. [17] If we were to describe the resonation process in terms of digital signal processing the notion of formants must be introduced. Formants have more than one de

54. nition in multiple research areas. The most common de

55. nition and the one used by the author describes formants as the spectral peaks of a given sound spectrum. While the resonator ampli

56. es and attenuates frequencies of the source, formants occur in its spectral envelope that are unique for each individual. [19] Figure 2.7: Graphic representation of formant creation [20]

57. Chapter 2. Literature Review 17 2-3 formants are enough to represent a person's voice despite the fact that our resonator produces more. [8] As shown in the bottom of

58. gure 2.7, formants look like the result of band pass

59. lters applied to the source. This parallelism is not far from the truth as formants are characterised by the same features as band-pass

60. lters which are centre frequency, gain and bandwidth. 2.3 Speech Synthesis Models Traditionally when we talk about speech synthesisers we refer to TTS (text to speech) systems due to their intuitive input method and vast popularity. Such systems can be broken down into multiple components that consist their functionality. One of the most important modules of TTS systems is the waveform generator. Based on the model used for the sound generation, speech synthesis systems can be classi

61. ed into three types. These three types are the Formant Syn- thesis, the Articulatory Synthesis and the Concatenative Synthesis. The dierent synthesis systems can also be classi

62. ed into two sub categories according to the extend of human intervention in the creation and execution process. Synthesis by rule uses a collection of supervised rules in order to perform synthesis and data driven synthesis derives its parameters from actual speech data. [21] 1. Formant Synthesis Formant synthesis uses a source-

63. lter model to generate intelligible sounds. The source-

64. lter model can be characterised as a simpli

65. ed version of the real life voice generation process. It can be simply described as the generation of

66. Chapter 2. Literature Review 18 a quasi-periodic pulsating signal (glottal source) and its

67. ltering by multiple variable band pass

68. lters with the appropriate formant parameters in order to have intelligible vowels produced. [21] Figure 2.8: Source-

69. lter model [20] Modelling multiple formant resonances in the digital world requires the implementation of 2nd order IIR (in

70. nite impulse response)

71. lters. Equation 2.2 shows the transfer function of a 2nd order IIR

72. lter. [21] The deriva- tion of the frequency response of a

73. lter from its transfer function will be examined in chapter 3. Hi (z) = 1 1 2ebi cos (2fi) z1 + e2biz2 (2.2)

74. Chapter 2. Literature Review 19 (2nd order IIR all-pole

75. lter transfer function with fi = Fi=Fs, bi = Bi=Fs where Fi, Bi and Fs are the formant's centre frequency, bandwidth and sam- pling frequency, respectively.) The choice of IIR

76. lters is not random. IIR

77. lters, in comparison to FIR

78. lters, are a lot more computationally ecient. Their low requirements of

79. lter coecients (ai, bi) make them faster and less memory consuming. On the other hand FIR

80. lters are always stable whereas IIR

81. lters can have poles outside the unit circle which will render them unstable. [21] There are two ways to combine a number of IIR

82. lters together. One way is to create a cascaded array and the other is to combine them in parallel. The parallel method is a lot more complex and is mainly used for the production of fricative sounds. The cascaded method is ideal for vowel sounds and is a lot easier to implement. One other very important dierence between the two methods is that the cascaded technique results in an all pole

83. lter whereas the parallel method results in a

84. lter that has zeros in addition to poles. Poles and zeros can disclose the frequency response characteristics of a

85. lter and are often used as the basis for digital

86. lter design. [21] H1 (z) = XM k=0 bkzk (2.3) H2 (z) = 1 1 PN k=1 akzk (2.4) (2.3 is the transfer function of an all-zero

87. lter and 2.4 of an all-pole

88. lter.)

89. Chapter 2. Literature Review 20 In reality voice signals are not stationary. Formant synthesis by rule takes into account the physical limitations of the vocal tract so that the change between formants does not occur abruptly whilst giving the user the ability to manipulate pitch and formant sweep in real time, however it leaves out important re ections and nuances that make the output sound realistic.[21] 2. Articulatory Synthesis Articulatory synthesis is a lot closer to formant synthesis in terms of syn- thesising voice by rule. It models the motion of our articulators and the resulting distributions of waveforms in the lungs, the oral and nasal cavities and the larynx. This model drives a formant synthesiser and uses 5 articulatory parameters: area of lip opening, constriction formed by the tongue blade, opening of the nasal cavities, average glottal area and rate of active expansion/ contraction of the vocal tract tube behind a blockade [21] Figure 2.9: A list of the human articulators [22]

90. Chapter 2. Literature Review 21 The nature of the human speech articulators does not allow them to perform large movements. On the contrary, due to the fact that they are so restricted it is easier to model them. The 5 articulatory parameters are interlaced with the fundamental frequency and the

91. rst 4 formant frequencies. Though this model can be the most promising in terms of speech quality, the methods to collect the aforementioned area parameters are not very advanced thus making articulatory synthesis the least accurate speech synthesis model of the three. [21] 3. Concatenative Synthesis Concatenative synthesis attempts to imitate speech in such way that it cap- tures all of the small details and secondary re ections in order to sound as realistic as possible. The principle behind this model involves the concatenation of several speech excerpts from recordings together so that a natural sequence of speech is formed. Unlike synthesis by rule this data driven synthesis model does not require any manual adjustments. In addition the segments of speech selected are real, so the output is expected to be of high quality. [21] In reality the cascaded segments are often dierent in terms of spectral and prosodic continuity. If the formants of one segment are not exactly the same with its adjacent or the perceived pitch is dierent from one clip to another, then discontinuities occur at the point of concatenation. The speech excerpts may be perfectly normal, it is their sequence that sounds unnatural. Under ideal conditions the concatenative synthesis model produces the most natural

92. Chapter 2. Literature Review 22 output however its design has to address many issues to avoid discontinuities. The more issues solved during the design process the better the outcome and naturally the more complicated the system is. In addition high quality data driven synthesis models require large amounts of data stored within their operating system and are very computationally expensive and consume a lot more memory than rule based models. [21] Figure 2.10: A simple concatenative synthesis diagram [23]

93. Chapter 3 Implementation The task of the project described in this paper is to build a real time computer based system that generates vowels in such way so that they are being transmitted by Dr. McPherson's magnetic resonator piano intelligibly. The author has created a user friendly digital audio synthesiser that has the role of the input and is in the form of an AU plugin. Moreover the spectral behaviour of the piano has been analysed and added to the system so that clearer results are achieved. The scienti

94. c process behind this attempt is examined in the remainder of this chapter. 3.1 Methodology The method proposed for the successful production of this system is the creation of a vowel synthesis AU plugin based on the formant synthesis by rule model mentioned in chapter 2. Two criteria were taken under consideration before making this decision; one was the computational requirements of the plugin and the second 23

95. Chapter 3. Implementation 24 was the ability to modify certain parameters in real time. The formant synthesis model oers low computational cost and full adjustment capabilities in real time as well as a decent quality output. The implementation of the real time vowel synthesis system consists of two primary phases; programming and spectral analysis of the piano. The JUCE framework was used for the programming stage which was carried out in C++ using X-Code 5. For the spectral analysis of the piano 2 DPA 2006A microphones and a TASCAM US-122MKII audio interface were used for the recording of the samples and

96. nally MATLAB was used for the analysis of the audio samples. 3.2 Preparation The correct programming of the plugin required some pre calculated data. Specif- ically, the formant centre frequency, bandwidth and gain values needed for the

97. ltering of the model, were captured using PRAAT. PRAAT is an open source speech feature extraction software that oers the option to detect voice formants by a single recording.[24] The

98. rst 4 formants for each vowel were analysed by the recordings of the author's voice. As shown in

99. gure 3.1, the red dots represent the formants of the vowels. The Y axis of the spectrogram is frequency in Hz and the X axis is time in seconds. It is clear that PRAAT provides comprehensive formant data across time in segments of 5 ms.

100. Chapter 3. Implementation 25 Figure 3.1: PRAAT spectrogram showing the

101. rst 4 formants of 3 vowels [24] 3.3 Programming The most comprehensive way to describe the creation of a source/

102. lter audio plugin is to divide it into two main processes. The

103. rst process would be the generation of the source and the second the

104. ltering of it. 1. Source In chapter 2 we thoroughly investigated the nature of the glottal source and concluded that it is a periodic pulsating signal rich in harmonics with gradually decreasing amplitude from the fundamental frequency and across the spectrum. Its digital representation is a band limited pulse which is essentially a series of harmonics or sinusoids: (t) = XN k=1 sin (2kf0t) (3.1) with a number of harmonics given by:

105. Chapter 3. Implementation 26 N = fs 2f0 (3.2) so that we avoid aliasing. A C++ oscillator class taken from Will Pirkle's book Designing Audio Plug-ins in C++ was used for the generation of multiple sine waves. This clever approach uses a 1024 sample buer to store sine wave values which are constantly updated. To avoid any fractional data point values within the buer, a linear interpolation using weighted sums function is being used[25] y = (x x1) (x2 x1) y2 + 1 (x x1) (x2 x1) y1 (3.3) for any y between data points (x1; y1) and (x2; y2). This procedure ensures phase coherence to a large number of oscillator instances created simultaneously. In practice 34 instances of the oscillator class were created, the

106. rst one representing the fundamental frequency and the rest 33 its harmonics, creating a band limited pulse at steady magnitude across its spectrum as shown in

107. gure 3.2. 2. Filtering

108. Chapter 3. Implementation 27 Figure 3.2: Impulse train in time and frequency domain The

109. ltering stage has been conducted in a way so that it extends the traditional methods. The obtained formant

110. lter values for 5 vowels (phonetically [a], [e], [i], [o], [ou]) have been inserted into 5 lookup tables. Each lookup table contains centre frequency (in Hz), gain (in dB) and Q (bandwidth) values for 4

111. lters. To adequately describe the performance of the plugin, this section is divided into the two main functions of the code. • Calculation of coecients: The aforementioned formant values are used to derive the necessary coecients for the 20

112. lters in total. Given:

113. Chapter 3. Implementation 28 G = 10g=20 F = 2 fc fs For : g = 1 And for : g 1 H = F 2Q H = F 2gQ Then the

114. lter coecients ai, bi are given by: a2 = 0:5(1H) (1+H) a1 = (0:5 a2) cos (H) b0 = (G 1)(0:25 + 0:5a2) + 0:5 b1 = a1 b2 = (G 1)(0:25 + 0:5a2) a2 (fc is centre frequency, g is gain and Q is the bandwidth of each

115. lter.) • Calculation of frequency response: Normally, a source/

116. lter model would use coecients to design

117. lters that will shape the band limited pulse generated, however this plugin incorporates something beyond just

118. ltering. In a separate function, each

119. lter's frequency response is derived from its transfer function. From the 2nd order

120. lter transfer function: H (z) = b0 + b1z1 + b2z2 a0 + a1z1 + a2z2

121. Chapter 3. Implementation 29 We calculate the frequency response by substituting z with ! = ej! and taking its complex conjugate: H (!) = b0 + b1 cos (!) jb1 sin (!) + b2 cos (2!) jb2 sin (2!) a0 + a1 cos (!) ja1 sin (!) + a2 cos (2!) ja2 sin (2!) H (!) = [b0 + b1 cos (!) + b2 cos (2!)] + j [b1 sin (!) b2 sin (2!)] [a0 + a1 cos (!) + a2 cos (2!)] + j [a1 sin (!) a2 sin (2!)] for: A = [b0 + b1 cos (!) + b2 cos (2!)] B = [b1 sin (!) b2 sin (2!)] C = [a0 + a1 cos (!) + a2 cos (2!)] D = [a1 sin (!) a2 sin (2!)] we get: H (!) = A+jB C+jD CjD CjD H (!) = [AC+BD]+j[ADBC] C2+D2 the magnitude response of the

122. lter: jH (!)j = 1 C2D2 q (AC + BD)2 + (AD BC)2 (3.4)

123. Chapter 3. Implementation 30 Note here that ! is frequency in radians (0 2) and can be written as ! = 2 fi fs The magnitude response of a

124. lter encloses gain information for each frequency within its bandwidth. Figure 3.3: Magnitude response of the cascaded

125. lter for vowel A For any generated sine wave with frequency fi which is part of the source waveform, the above function calculates its amplitude according to the 4 formant

126. lters of each vowel. Essentially the plugin generates 34 pre

127. ltered signals giving a lot more exibility in terms of transmission.The resulting synthesis method is the additive synthesis because the source is being constructed instead of being carved into shape. [3] Figure 3.4 demonstrates how the frequency response function has shaped the band limited pulse according to the 4 resonances of the vowel A. It

128. Chapter 3. Implementation 31 Figure 3.4: The vowel A transmitted from the plugin is clearly shown how dierent frequencies are generated with dierent magnitudes. The 4

129. lters are combined in a cascaded array at the end of the C++ function. As mentioned in chapter 2, the cascaded method is ideal for vowel synthesis and very easy to implement. Expressing the

130. lter transfer function in a factorized form as seen below: H (z) = PM k=0 bkzk 1 PN k=1 akzk = G (1 z1z1) (1 z 1z1) (1 z2z1) (1 zz1) : : : (1 p1z1) (1 p1 z1) (1 p2z1) (1 p2 z1) : : : (3.5) where G is the cascaded

131. lter gain and the poles pi and zeros zi are either complex conjugate pairs or real-valued. By grouping the factorized equation in terms of the complex conjugate and real valued pairs we

132. Chapter 3. Implementation 32 get: H (z) = G (1 z1z1) (1 z 1z1) (1 p1z1) (1 p1 z1) (1 z2z1) (1 z 2z1) (1 p2z1) (1 p2 z1) : : : (3.6) Equation 3.6 represents the cascaded expression of 2nd order IIR

133. lters enclosed in the big brackets which can also be expressed as: H (z) = G YK k=1 Hk (z) (3.7) Finally, a low pass

134. lter with centre frequency at 2.5kHz has been added to the cascaded array so that unwanted harmonics are neglected. 3.4 Spectral Analysis The magnetic resonator piano, like most structures, has some speci

135. c acoustic properties. When sound waves travel through its body their spectrum is inevitably shaped in a similar way the vocal tract shapes the glottal source. In the attempt to transmit the output of the plugin through the MRP, the piano's formants have to be taken under consideration. The accurate reproduction of the vowels generated from the AU plugin requires a medium with a at frequency response. Given that the piano certainly does not have a at frequency response a compensative method was necessary.

136. Chapter 3. Implementation 33 The strings used for this project were speci

137. c so that their resonant frequencies cover the vocal range (50 - 500Hz). C2, E2, G2, D3, E3, G3, C4, E4 and G4 were tested and their output was recorded in order to be analysed. Their resonant frequencies are: • C2 - 65Hz • E2 - 82Hz • G2 - 97Hz • D3 - 146Hz • E3 - 164Hz • G3 - 197Hz • C4 - 260Hz • E4 - 327Hz • G4 - 389Hz The idea is to feed the strings with all the vowels, a version of all the vowels

138. ltered at 1kHz and white noise all at dierent dB levels, record all the outputs and derive an average frequency response for each string and consequently the whole body of the piano. Knowing how each string responds is very important towards the implementation of an inverse

139. lter which will atten that response and make vowel transmission more intelligible.

140. Chapter 3. Implementation 34 Figure 3.5: The frequency response of C4 under all test conditions for vowel A Figure 3.6: The frequency response of G3 for all vowels

141. Chapter 3. Implementation 35 The response of a system h(n) knowing its input x(n) and output y(n) is a simple comparison of y(n) with regard to the x(n) and in theory it should be the same no matter what the input is. Figures 3.5 and 3.6 show the response of the string C4 under a variety of test conditions. The responses of each string for all vowels under every test condition were gathered in a signi

142. cant number of arrays. By averaging these arrays a clear view of how each string behaves to a number of harmonic inputs is revealed and by taking their inverse average response a quanti

143. able pattern emerges towards the creation of the inverse

144. lter. Figure 3.7: Average frequency response and its inverse for string G3 The amplitude dierences between the red line and the blue line, shown in

145. gure 3.7, at the frequencies corresponding to the peaks of the red line are the gain and frequency values to be imported to the inverse

146. lter.

147. Chapter 3. Implementation 36 3.5 Plugin Parameters Description The design of the vowel generator AU plugin is such so that it oers a variety of adjustable parameters to the user. This section includes a description of its architecture and an examination of its features. Figure 3.8: A picture of the plugin in operation Figure 3.8 shows the layout of the plugin. Its parameters from top to bottom are: • A master ON/ OFF button enables and disables the output. The button turns red to indicate when it is in OFF mode.

148. Chapter 3. Implementation 37 • A volume knob ranging from 0 to 10 dB with a step of 0.1. • The wave type drop down list that oers 4 choices of waveform types to be generated. The user has the option to create vowels with sine waves, triangle waves, sawtooth waves or rectangular waves. • The vowel type drop down list is the parameter with which the user can switch between the 5 dierent vowels. • A frequency knob that sets the fundamental frequency of the vowel. This parameter ranges from 50 to 500 Hz with a step of 1. • 9 buttons for each string. These buttons con

149. gure the settings of the inverse

150. lter according to the analysis conducted on each string separately. The user may switch between 9

151. lters depending on what piano key he is using. Each button turns white when active and red when not. • 34 volume sliders, one for each harmonic and the fundamental frequency f0. These sliders function as a graphic equaliser and range between -50 to +50 dB with a step of 1. When an inverse

152. lter button is pressed, these sliders automatically take values for the inverse

153. lter of the corresponding string. 3.6 Results This project set out to create a vowel synthesis system that receives intelligible vowels from an AU plugin and plays them back through the MRP strings keeping a their intelligibility. The results of this attempt are dicult to show

154. Chapter 3. Implementation 38 on paper. Evaluating whether a vowel is intelligible or not means one has to listen to it to comprehend it. Being able to identify whether a sound is a vowel, natural or not, is the goal here. The spectral content of the output of the piano in comparison to the output of the plugin is the optimal way to view this project's results. Due to the sheer volume of results this section will show the three most intelligible vowels transmitted from the piano and the three least intelligible. Figure 3.9: Spectrum of piano output vs plugin output of G3 playing the vowel O

155. Chapter 3. Implementation 39 Figure 3.10: Spectrum of piano output vs plugin output of E4 playing the vowel A Figure 3.11: Spectrum of piano output vs plugin output of C4 playing the vowel O

156. Chapter 3. Implementation 40 Figure 3.12: Spectrum of piano output vs plugin output of G2 playing the vowel A Figure 3.13: Spectrum of piano output vs plugin output of C2 playing the vowel I

157. Chapter 3. Implementation 41 Figure 3.14: Spectrum of piano output vs plugin output of D3 playing the vowel E The

158. gures of this section show the spectral behaviour of 6 piano strings, represented by the blue lines, superimposed on the spectrum of the plugin output or the piano input, represented by the dashed red lines. Intelligibility of the piano output is shown when the peaks of the blue line follow the pattern created by the peaks of the red line. In other words when the spectral shape of the blue line is close to the spectral shape of the red line or in some occasions the same, regardless of the general dierence in dB level, then it means that the plugin output has retained the magnitude relationship of its spectral peaks through the piano strings. In the occasion where the formants of the two lines do not follow the same pattern, it is clear that the acoustical structure of the piano has distorted the input's frequency content, thus the inverse

159. ltering was ineective.

160. Chapter 3. Implementation 42 It is clear that

161. gures 3.9, 3.10 and 3.11 are graphs of intelligible vowels whereas

162. gures 3.12, 3.13 and 3.14 are plots of vowels that are quite dif-

163. cult to identify. Further discussion on the results will take place in the conclusion.

164. Chapter 4 Conclusion This

165. nal chapter concludes the examination of the project. A discussion on the method chosen, the implementation and the results is included as well as a

166. nal evaluation. The last two sections of the chapter concern possible improvements in terms of programming and analysis and a summary of the report. 4.1 Discussion A thorough investigation was carried out on the science behind the human voice, the most popular speech synthesis systems in the market and the magnetic resonator piano. In the literature review, this paper examined the mechanics of how voice is produced, deducted measurable parameters of voicing, and explained the major dierences between three speech synthesis models. 43

167. Chapter 4. Conclusion 44 The formant synthesis model was chosen, which according to the investigation was the appropriate method in terms of best quality for vowel generation and low computational cost. The JUCE framework was used within Xcode 5 on a Macintosh operating system for the creation of an audio plugin. The implementation of this model required the generation of a source waveform and its

168. ltering by a cascaded array of 2nd order IIR

169. lters. In chapter 3 a detailed analysis of the source/

170. lter model is included, providing a description of the

171. ltering stage which takes this method a step further. After intelligible vowels were produced by the AU plugin, an analysis on how the structure of the MRP aects the frequency content of the input was conducted. The analysis revealed a pattern in the frequency response of the 9 strings of the piano that were used in this project. Nine inverse

172. lters were incorporated into the audio plugin in order to obtain a at frequency response. Finally, 6 plots were shown representing the most intelligible vowels and the least intelligible vowels generated by this complex system. 4.2 Evaluation The project's aims were to create a vowel generation system using a digital audio synthesiser and to transmit its output accurately via the strings of the MRP. Considering that the AU plugin produced intelligible vowels that are in no way realistic, an understanding emerges of how the evaluation process will be carried out.

173. Chapter 4. Conclusion 45 In terms of spectral shape relationship between the input and the output, a close relationship meaning success and a far relationship meaning failure, this project has been a success. Most of the 45 vowels transmitted via the piano strings at 9 dierent fundamental frequencies, had a very close spectral shape relationship with the input. The vowels generated by the digital audio synthesiser, have neglected a signi

174. cant number of characteristics that would make them sound real. Early re ections in the nasal and oral cavities, unvoiced sounds, small pitch variations and many other parameters are not included in the source/

175. lter model. The resulting sound lacks the transients of the natural voice and its closest approximation to reality is as if we took the steady state parts of a real vowel and put it in an inde

176. nite loop. Although real it would not sound natural or intelligible. Within a context of successively changing vowels of the same type though the perception of intelligibility we have changes dramatically. For example when the piano plays the vowel A at 65Hz through C2, it sounds nothing like an A. But when it plays all the vowels in a successive order we start to recognise the dierent types. Even the vowels whose spectrum plots did not meet the success criteria, in a context of successive transmission of dierent vowels they become intelligible. And vice versa, the most intelligible vowels in terms of spectral shape relationship are not intelligible when played out of any context. To conclude, the author's evaluation is that the project is a successful

177. rst step towards a useable vowel generation system but as it is at the moment

178. Chapter 4. Conclusion 46 it is incomplete. 4.3 Improvements Possible improvements towards the better function of this vowel synthesis system concern programming and spectral analysis. { Algorithm improvements The AU plugin although very successful, could incorporate some more parameters to generate a more natural output. Small pitch variations of the fundamental frequency of the source and the centre frequency of each formant

179. lter could be implemented in order to model the human voice more accurately. White noise could model the unvoiced air escap- ing the vocal folds resulting in a harmonically richer output. Finally a parallel array of 2nd order IIR

180. lters could make consonant generation possible and would model vowel re ections in the mouth and nasal cavities. { Spectral analysis improvements The inverse

181. ltering applied in this project is based on the linear relationship between input and output. The MRP is a physical structure and like all physical structures it is non linear. One of the main reasons that this project had partial success in the frequency content relationship between input and output, was intermodulation distortion. This

182. Chapter 4. Conclusion 47 non linear distortion produced by the acoustic architecture of the piano, resulted in the appearance of some very unpredictable harmonics in the output spectrum. A major improvement for this project would be to analyse the non linearity of the piano and derive a mathematical model that will predict the distortion and will design its inverse

183. lter more accurately. This may allow the creation of a function performing vowel transition across their corresponding spectral targets. This parameter was attempted during this project but distortion coming from the piano excited a vast variety of harmonics during the vowel transition. The result was very noisy and contradicted with the goals of the project.

184. Chapter 4. Conclusion 48 4.4 Summary and Final Thoughts The goals of this bold attempt described in this paper both have been and have not been achieved. The research and implementation by the author may have resulted in a robust vowel synthesis system which successfully transmits the majority of the vowels from a digital audio synthesiser via the MRP strings, however it is not a useable instrument. Improvements have been proposed and room for further research by other scientists has been left by the author. Overall, it is hoped that this project has been a positive step towards the fascinating

185. eld of speech processing.

186. Bibliography [1] History and development of speech synthesis. 2006. URL http://www.acoustics.hut.fi/publications/files/theses/ lemmetty_mst/chap2.html. [2] Richard W. Sproat. Multilingual Text-to-Speech Synthesis: The Bell Labs Approach, volume 4. Springer, 1997. [3] Sam O'Sullivan. Understanding the basics of sound synthesis. The Pro Audio Files, February 2012. URL http://theproaudiofiles. com/sound-synthesis-basics/. [4] Andrew McPherson. The magnetic resonator piano: Electronic aug- mentation of an acoustic grand piano. Journal of New Music Research, 39(3):189 { 202, 2010. [5] Youngmoo Kim Andrew McPherson. Augmenting the acoustic piano with electromagnetic string actuation and continuous key position sensing. NIME, 1:217{222, 2010. URL http://www.educ.dab.uts.edu.au/nime/PROCEEDINGS/papers/ Paper%20K1-K5/P217_McPherson.pdf. 49

187. Bibliography 50 [6] Glen Lee. Voice synthesis. The Encyclopaedia of Virtual Environ- ments, 1, 1993. URL http://www.hitl.washington.edu/projects/ knowledge_base/virtual-worlds/EVE/I.B.2.VoiceSynthesis. html. [7] Upper respiratory system diagram. . URL http://quizlet.com/ 12905648/module-1-the-respiratory-system-anatomy-and-physiology-flash-cards/. [8] Johan Sundberg. The acoustics of the singing voice. 1997. URL http: //www.zainea.com/voices.htm. [9] Jackie R. Haynes and Ronald Netsell. The mechanics of speech breathing: a tutorial. Department of Communication Sciences and Disorders Southwest Missouri State University, 2001. [10] Deirdre D. Michael. About the voice. Lions Voice Clinic, 2014. URL http://www.lionsvoiceclinic.umn.edu/page2.htm. [11] Larynx. URL http://learnhumananatomy.com/larynx/. [12] The Voice Foundation. Voice anatomy physiology. 2014. URL http://voicefoundation.org/health-science/voice-disorders/ anatomy-physiology-of-voice-production/. [13] Janwillem van den Berg. Myoelastic aerodynamic theory of voice production. Journal of Speech, Language, and Hearing Research, September 1958. URL http://jslhr.pubs.asha.org/article.aspx? articleid=1749406.

188. Bibliography 51 [14] C. Julian Chen. Physics of human voice: A new theory with applica- tion. Research Conference Columbia University, 1(1):1{19, November 2012. URL http://www.google.co.uk/url?sa=trct=jq=esrc= ssource=webcd=1ved=0CCMQFjAAurl=http. [15] Glottal source spectrum. . URL http://www.ncvs.org/ncvs/ tutorials/voiceprod/images/5.5.jpg. [16] Dinesh K. Chhetri. Neuromuscular control of fundamental frequency and glottal posture at phonation onset. Acoustical Society of America, November 2011. URL http://headandnecksurgery.ucla.edu/ workfiles/Academics/Articles/neuromusc_control_chhetri_et_ al.pdf. [17] Maeva Garnier Joe Wolfe and John Smith. Voice acoustics: an introduction. UNSW, 2, May 2009. URL http://www.phys.unsw.edu.au/ jw/voice.html. [18] Eric Armstrong. Journey of the voice: Anatomy, physiology and the care of the voice. Voice and Speech source, 2008. URL http://www. yorku.ca/earmstro/journey/resonation.html. [19] Fant G. Acoustic theory of speechproduction. Mouton co, 1960. [20] Source/

189. lter model. URL http://www.phys.unsw.edu.au/jw/ speechmodel.html.

190. Bibliography 52 [21] Alex Acero Xuedong Huang and Hsiao Wuen Hon. Spoken Language Processing a Guide to Theory, Algorithm, and System Development. Prentice Hall,

191. rst edition, 2001. [22] Articulators. URL http://educationcing.blogspot.co.uk/2012/ 08/articulatory-phonetics-vocal-organs.html. [23] Concatenative synthesis. URL http://www.acoustics.hut.fi/ publications/files/theses/lemmetty_mst/chap9.html. [24] Praat. . URL http://www.fon.hum.uva.nl/praat/. [25] Will Pirkle. Designing Audio Eect Plug-Ins in C++. Focal Press,

192. rst edition, 2012.

Real-Time Vowel Synthesis - A Magnetic Resonator Piano Based Project_by_Vasileios_Valavanis

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to Real-Time Vowel Synthesis - A Magnetic Resonator Piano Based Project_by_Vasileios_Valavanis

Similar to Real-Time Vowel Synthesis - A Magnetic Resonator Piano Based Project_by_Vasileios_Valavanis (20)

Real-Time Vowel Synthesis - A Magnetic Resonator Piano Based Project_by_Vasileios_Valavanis