Shishir Tandale
Teaching a Computer to Speak
Sound is everywhere in our society. From Mozart’s symphonies to Dr. Martin Luther
King’s speeches to the faint noise of rain hitting the window during an evening drizzle. Sound
communicates more than just a simple message. Speech in particular possesses bias, intrinsic
emotion, perhaps a hint of sarcasm at points, which try as you might, is nearly impossible to
isolate the source of. What word makes something sarcastic? Is it a tone, or rhythm of speech?
The root of sarcasm might even differ between people with one person dragging each syllable
and modulating the frequency of each phrase to try and exaggerate his speech to the point of
sounding sarcastic. The analysis of human vocal structure is an ongoing problem with many
systems to work around and no one solution to the problem of “what about our speech makes it
human?” Back in the 1980s, medical researchers published a paper in the American Journal of
Nursing describing an artificial speech device which plays back a series of recording vocal audio
clips in such a way to emulate human speech. This rather simple invention was heralded at the
time for revolutionizing the options patients with speech disabilities had to communicate
(McCormick et al. 1982). Had we actually heard this invention working today with its jittery,
muffled, electronic-sounding emulation of human speech, none of us would have considered this
“artificial speech device” a suitable solution to a speech disability. Today’s artificial speech
devices are tremendously different in both form and function. Any recent smartphone has the
capacity to launch Siri, Google Now, or even Microsoft’s Cortana virtual assistant which flood
the room with a crystal clear voice asking the user what he or she needs today, with some already
compiling a daily schedule and letting you know about an upcoming flight you should be
packing for (Hughes 2014). Voice analysis is still a growing field with current research problems
being emotional recognition in speech, and creating the next stage of artificial speech device,
vastly surpassing McCormick’s invention in the early ‘80s and even Siri sitting in your pocket.
This is an important science that is paramount to product innovation because our voices are
fundamentally what make us human, and understanding the humanity and emotion in voice lets
us build technology that can assist us in ways we never thought possible before.
A first step in understanding the human voice is to try and emulate it. The current method
of creating artificial human speech is simply recording a speaker, breaking each passage he or
she reads into each phonetic sound, and storing everything in a database for playback later (a
text-to-speech program). Then later, when the program is fed in a passage of text, it can associate
each word with its syllables and each syllable with the appropriate phonetic sound. The parts are
put together into a whole audio file, and played back (Acapela Group 2014). There are plenty of
limitations to this method, such as the lack of flexibility when recording a new voice; if
somebody wanted to give Siri a man’s voice, the only possible way to create one would be to
find a speaker willing to sit down and record the long passages needed for a proper virtualized
voice. There is also a tremendous range of different ways a certain passage could be spoken,
each with a unique connotation and tone. Without locking the speaker in a sound-proof room for
weeks at a time reading passage after passage, there is no feasible way to make a complete
database to map everything, and this leads to holes in the database.
A study on the efficacy of these vocal databases found when judges listened to different
sentences spoken by a popular Swedish text-to-speech program with different target emotions to
express, the program was found to be fairly accurate in some cases while suffering from issues
elsewhere. One issue was a falsetto voice used by the program was incorrectly interpreted as
both fearful and bored, two mutually exclusive feelings (Gobl & Chasaide 2003). This shows a
detachment between current speech synthesis capabilities (a synonym for text-to-speech) and
what is needed to emulate a human voice more accurately. Emotions specifically seem to be an
issue that needs further analysis.
Emotional information is what differentiates authentic human speech from everything
else. A paper titled “Human Voice Perception” published in Current Biology wanted to map out
what makes the human voice so distinctive from other sounds. The study placed subjects inside
an fMRI machine while playing audio clips of people reading sentences, and the researchers
observed a consistently high level of activity in the region of the brain which specializes in
language processing. Later, the study was repeated but this time the sound clips played were of
different people reading sentences in various languages, some of which the subjects did not
know. There were even audio clips of what used to be a person reading a sentence, but fed
through a 500Hz filter to remove the spoken words we associate with human speech mixed in
with the rest. Despite a good portion of the audio clips being essentially unrecognizable to most
English speaking people, the researchers found a similar level of activation in the same regions
of the brain. This was balanced by making the subjects listen to non-human sounds such as white
noise and static leading to no such activation in those specific parts of the brain (Latinus & Belin
2011). This indicates that there is something special about human speech and it is not the words
being spoken, otherwise the sample that received the audio clips with the actual speaking part
being removed, or even the sample that received foreign audio clips would not have had the
same level of brain activity as the sample that received normal sentences being read in English.
The only connection between the samples that had an activation in the language processing part
of the brain was that they were read aloud by a human, showing something is in the voice that
causes us to listen more intently, and consider the source a human.
This is evidenced in “Recognition of Emotion from Vocal Cues” from Arch General
Psychiatry where Johnson, Emde, Scherer, and Klinnert repeated Gobl and Chasaide’s study on
artificial voices using a Swedish text-to-speech program, but instead using real human speakers
told to imitate a certain emotion when speaking to the judges. The speakers were recorded, and
the audio clips were given to the judges to decide what emotion was being shown in the clip. The
first limitation of this study to note was that when told to act out a certain emotion, the speakers
naturally exaggerated aspects of their speech to imitate it better. An interesting part of this study
was that some of the clips were given to the judges scrambled up, as in the audio clip was
reversed, clipped, passed through different vocal filters (which isolate specific frequencies of
sound), and otherwise modified. The result was that the judges almost always guessed the correct
emotion, even when given altered audio clips of the speakers (Johnson et al. 1986). Combined
with the results of Latinus and Belin’s study which showed that the human voice has some
special quality which is intrinsic and cannot be erased, this study by Johnson and his team found
that that quality is emotion. Emotional data is a quality of human voice that does not change
when the voice is flipped or distorted, and even when the entire spoken word portion of a voice
is removed from an audio clip, emotional data is left behind. Most importantly, emotion is a trait
of human voice that cannot be emulated by any current text-to-speech programs by any means.
The problem of figuring out what to do with this emotional data goes hand-in-hand with
finding specifically where emotion hides in human speech – the idea being if we know what
makes something emotional, we can emulate it. The first real leap in emotional speech analysis
was a patent filed in 2000 titled “Apparatus and Methods for Detecting Emotions in the Human
Voice” (Liverman) which described a procedural, mathematical algorithm for categorizing and
quantifying emotions in human speech. The first step to his algorithm is to break apart a single
voice waveform (simply a sound wave of a spoken voice) into its multiple composite waveforms
each of discrete frequencies. Any sound we make is really a vibration of multiple vocal chords,
and what a special program called a spectrograph does is take the entire voice and isolate the
waveforms made by each particular set of vocal chords. This is analogous to using a prism to
refract white light into all the colors of a rainbow. Liverman states that the lowest few
frequencies correlate strongly with high emotional activity. The next step find out what these
emotions might be is to find something called a fundamental frequency (f0), essentially an
average frequency of the lowest parts of a voice indicative of a person’s mood, according to
Liverman, and based on the actual spoken word component of a voice a guess for the actual
classification of the emotion (e.g. happiness, sadness, anger) can be made and quantified by
analyzing other aspects of vocal topology. Ultimately, emotions are located using activity in the
lowest frequencies of a voice, and characterized and quantified using f0 and various other traits
of a voice.
Characterizing a person’s voice does not stop there. Even with Liverman’s algorithm,
holes still exist in the accuracy of our models. In an article titled “Acoustic Markers of Emotions
based on Voice Physiology” published as a part of the Speech Prosody International Conference,
Sona Patel and her team immediately disregarded the use of fundamental frequency (f0) as a
metric of emotional classification. They argue that no scientifically verifiable study achieved
statistically relevant results classifying a particular emotion from a person’s voice using f0 as a
main metric. Patel’s team found a suitable classification metric which is also statistically
verifiable by using other topological traits such as alpha ratio, Leg ratio, LTAS, mean
characteristic frequencies, and varying types of shimmer (Patel et al. 2010). This directly
contradicts the method Liverman was using to specify emotion. However, Liverman’s algorithm
still described a procedure to identify parts of a voice that signify high emotional activity which
is supported by Latinus and Belin, and Johnson and his team’s research on the relevance of
emotion in human voice. Patel’s team did replace Liverman’s dated assumption of the usefulness
of f0 with a much more complicated multi-part metric spanning the whole topology of the voice
(voice “topology” is used to describe physical characteristics of a voice that can be directly
observed without analysis), showing that emotional characterization is not as simple as
measuring one quantifiable property of voice. A sophisticated voice analysis engine is required
that can measure multiple properties with extreme precision, leading to a dead-end to the quest of
finding a simple way to emulate human emotion in speech. Since speech synthesis programs of
the future should be able to work quickly without any delay in action, relying on the
incorporation of all the qualities Patel’s team studied is impractical and points to a limitation in
the current way we attempt to emulate speech.
A promising alternative to the classic database text-to-speech program is something
called a need-based emotional model, or NEMO. This concept was referred to by researcher
Syahherah Lutfi and his team in their paper “I Feel You: The Design and Evaluation of a
Domotic Affect-Sensitive Spoken Conversational Agent” in the journal Sensors. A NEMO is a
machine learning text-to-speech program, most commonly seen with extra lossless frequency and
amplitude modulation capabilities. When compared to a classical text-to-speech program, a
NEMO ranks between twice and three times as effective at communicating emotion in speech
with comparable speed and efficiency (Lutfi et al. 2013). The term machine learning usually
refers to a neural network, a complex mathematical algorithm which accomplishes a task (in this
case, transforming text to speech), analyzes the correctness of its result, and adjusts its
procedures to ensure the next iteration is more accurate and conforms to the required model. This
small relegation is powerful as it means after a training period, the speech synthesis program is
very accurate, leading to the results Lutfi and his team saw. Incorporating the details Patel’s team
discovered would be as simple as training a NEMO with those parameters and letting it self-
teach the correct concepts. At its heart, a NEMO seeks to emulate a human thinking process to
ensure maximum emotional potency, thus at its best, a NEMO does not require a hard-wired
catalog of things it can say.
This is different than the classical text-to-speech database model which explicitly
transforms sentences into speech. Currently, several text-to-speech applications are in use around
the world for various things; Dennis Mitzner, chief editor of Inside3DP and commentator on a
variety of technical issues for many popular news sources recently wrote about Voxdox, a text-
to-speech app specifically catered towards dyslexics (Mitzner 2014). Since dyslexics have
trouble reading text, reading the text out loud is an effective solution and text-to-speech apps
allow this to happen without another reader being present. For simple tasks like this, there is no
need for a NEMO-based system’s inherit complexity or flexibility, constraining the true
usefulness of a NEMO to be applied directly to an artificial intelligence, such as Siri or Google
Now already omnipresent in many of our lives.
Imagine being able to talk to Siri or Google Now but in a completely informal setting
without worrying whether or not the apps can comprehend you. Better yet, imagine Siri being
able to talk back to you with any emotion she feels is relevant to the current situation. You tell a
joke, and she laughs; you ask a ridiculous question, she gives back a sarcastic response; you
sleep through your alarm in the morning and she’s angry when she wakes you up. While this
sounds like living with a girlfriend or wife, it could very well be the future of artificial
intelligence thanks to a NEMO calibrated to give the best response for you in particular. As
devices become faster and more wide-spread, increasingly more complex NEMO-based systems
can be implemented into phones, cars, laptops, even office buildings while being connected to a
huge artificial intelligence system behind the scenes. Secretary jobs may give way to a highly-
tailored artificial intelligence, and cars can drive themselves while keeping you updated on
everything that is going on at work, if and only if we have the capacity to build such systems.
Configuring a machine to understand human demands is quite separate from
programming one to think like a human, like a NEMO. While a NEMO might have a week-long
training session to learn enough information to roughly emulate human emotion, we learn about
everything for years of our lives. Our experiences and memories play a large role in how we
think. For a sophisticated artificial intelligence system to exist, a NEMO would have to have the
same learning chances which is impractical to send a NEMO into anywhere between 18 and 60
years of training for an average secretary position. Sophisticated object and command
recognition can be built into a NEMO so it would not require as much training, but as illustrated
by Google researcher Oroil Vinyals and his team at Cornell’s computer vision research project,
computer object recognition still recognizes only around fifty percent of scenes correctly without
major inaccuracies in its recognition (Google Research Blog 2014). Even the finest neural
network-based computer vision programs have similar if not worse levels of failure in looking at
a picture of Richmond, VA’s Monroe Park and attempting to describe it from anything between
“Apple Tree” and “Large Field with Trees.” Very seldom do object recognition engines reach a
level of competence that allow such a picture to be described as simply, a “Park.” While the
previous descriptions were not inherently incorrect (there may or may not be an apple tree, and
Monroe Park does have a few field-like areas), the overall description is flawed. This is
analogous to a secretary telling you to fly to the Swiss Alps rather than the Andes mountain
range. Fundamentally, mountains are mountains, but if you happen to be an archeologist
investigating the Inca civilization, Lindt chocolates do not help with your research.
Speech analysis and synthesis have come a long way since McCormick’s team revealed
their “artificial speech device” in the early ‘80s (McCormick et al. 1982). Several developments
in speech analysis included a rudimentary look at what makes up human voice in terms of
intrinsic properties. Emotion was revealed to be a major part of human voice and persisted even
in the futile attempts to modify a voice by shearing, reversing, and filtering part of the audio by
Johnson’s team (Latinus & Belin 2011). The fact that emotion could still be detected after heavy
modification of the source audio pointed towards it being a physical property reveled in directly
observable traits of the speech file rather than a result of analysis of verbal information (Johnson
et al. 1986). This allowed Liverman to develop an algorithm describing the emotion hidden
inside a voice by running the audio through a spectrograph and analyzing the lowest few
frequencies and using other characteristics of the voice to determine classification (Liverman
2000). While Liverman’s method of classification was debunked by others such as Latinus and
Belin as well as Patel and her team (Patel et al. 2010), his method of finding regions of high
emotional activity was still relevant and could possibly be used to improve Patel’s team’s
analysis. Since Patel’s team’s metrics of classification were so detailed and complicated to
emulate when synthesizing voice from scratch, it is not much use in current speech synthesis
techniques, pointing to a deficiency in current emulation techniques. A need-based emotional
model is a suitable replacement for current database techniques by augmenting a database with a
sophisticated neural network allowing for further emotional sensitivity. Since a NEMO emulates
a human thought process, beyond a sophisticated text-to-speech engine, a NEMO can directly be
applied to an artificial intelligence. Research in speech analysis has led to rapid product
innovation taking us from a rudimentary medical device from the 1980s to globally useful
applications such as Siri and Google Now today. Speech analysis is an invaluable field and is
very important for the future of technology.
Works Cited
Gobl, C. Chasaide, A. (2003). The role of voice quality in communicating emotion, mood and
attitude. Speech Communication, 40(1), 189-212.
Google Research Blog. (2014, September 5). Building a deeper understanding of images. Author.
Hughes, N. (2014). Tests find Apple's Siri improving, but Google Now voice search slightly
better. AppleInsider.
Latinus, M. Belin, P. (2011, February 22) Human Voice Perception. Current Biology, 21(4),
R143-R145. http://dx.doi.org/10.1016/j.cub.2010.12.033
Liverman, Amir. (2000, April 11). Apparatus and Methods for Detecting Emotions in the
Human Voice. International Application Published under the Patent Cooperation Treaty
(PCT). Application Number: PCT/IL00/00216.
Lutfi, S. L., Fernandez-Martinez, F., Lorenzo-Trueba, J., Barra-Chicote, R., Montero, J. M.
(2013). I Feel You: The Design and Evaluation of a Domotic Affect-Sensitive Spoken
Conversational Agent. Sensors, 13(8), 10519.
McCormick, G. P. White, M. J. Simtoski, F. J. Stanley L. H., McGuinness, K. (1982, January).
Artificial Speech Devices. The American Journal of Nursing, 82(1), 121-122.
Mitzner, D. (2014). Text-to-Speech apps aid Students with Dyslexia. InformationWeek.
Patel, S., Scherer, K. R., Sundberg, J., Bjorkner, E. (2010). Acoustic Markers of Emotions Based
on Voice Physiology. Speech Prosody International Conference.

ESSAY2

  • 1.
    Shishir Tandale Teaching aComputer to Speak Sound is everywhere in our society. From Mozart’s symphonies to Dr. Martin Luther King’s speeches to the faint noise of rain hitting the window during an evening drizzle. Sound communicates more than just a simple message. Speech in particular possesses bias, intrinsic emotion, perhaps a hint of sarcasm at points, which try as you might, is nearly impossible to isolate the source of. What word makes something sarcastic? Is it a tone, or rhythm of speech? The root of sarcasm might even differ between people with one person dragging each syllable and modulating the frequency of each phrase to try and exaggerate his speech to the point of sounding sarcastic. The analysis of human vocal structure is an ongoing problem with many systems to work around and no one solution to the problem of “what about our speech makes it human?” Back in the 1980s, medical researchers published a paper in the American Journal of Nursing describing an artificial speech device which plays back a series of recording vocal audio clips in such a way to emulate human speech. This rather simple invention was heralded at the time for revolutionizing the options patients with speech disabilities had to communicate (McCormick et al. 1982). Had we actually heard this invention working today with its jittery, muffled, electronic-sounding emulation of human speech, none of us would have considered this “artificial speech device” a suitable solution to a speech disability. Today’s artificial speech devices are tremendously different in both form and function. Any recent smartphone has the capacity to launch Siri, Google Now, or even Microsoft’s Cortana virtual assistant which flood the room with a crystal clear voice asking the user what he or she needs today, with some already compiling a daily schedule and letting you know about an upcoming flight you should be packing for (Hughes 2014). Voice analysis is still a growing field with current research problems
  • 2.
    being emotional recognitionin speech, and creating the next stage of artificial speech device, vastly surpassing McCormick’s invention in the early ‘80s and even Siri sitting in your pocket. This is an important science that is paramount to product innovation because our voices are fundamentally what make us human, and understanding the humanity and emotion in voice lets us build technology that can assist us in ways we never thought possible before. A first step in understanding the human voice is to try and emulate it. The current method of creating artificial human speech is simply recording a speaker, breaking each passage he or she reads into each phonetic sound, and storing everything in a database for playback later (a text-to-speech program). Then later, when the program is fed in a passage of text, it can associate each word with its syllables and each syllable with the appropriate phonetic sound. The parts are put together into a whole audio file, and played back (Acapela Group 2014). There are plenty of limitations to this method, such as the lack of flexibility when recording a new voice; if somebody wanted to give Siri a man’s voice, the only possible way to create one would be to find a speaker willing to sit down and record the long passages needed for a proper virtualized voice. There is also a tremendous range of different ways a certain passage could be spoken, each with a unique connotation and tone. Without locking the speaker in a sound-proof room for weeks at a time reading passage after passage, there is no feasible way to make a complete database to map everything, and this leads to holes in the database. A study on the efficacy of these vocal databases found when judges listened to different sentences spoken by a popular Swedish text-to-speech program with different target emotions to express, the program was found to be fairly accurate in some cases while suffering from issues elsewhere. One issue was a falsetto voice used by the program was incorrectly interpreted as both fearful and bored, two mutually exclusive feelings (Gobl & Chasaide 2003). This shows a
  • 3.
    detachment between currentspeech synthesis capabilities (a synonym for text-to-speech) and what is needed to emulate a human voice more accurately. Emotions specifically seem to be an issue that needs further analysis. Emotional information is what differentiates authentic human speech from everything else. A paper titled “Human Voice Perception” published in Current Biology wanted to map out what makes the human voice so distinctive from other sounds. The study placed subjects inside an fMRI machine while playing audio clips of people reading sentences, and the researchers observed a consistently high level of activity in the region of the brain which specializes in language processing. Later, the study was repeated but this time the sound clips played were of different people reading sentences in various languages, some of which the subjects did not know. There were even audio clips of what used to be a person reading a sentence, but fed through a 500Hz filter to remove the spoken words we associate with human speech mixed in with the rest. Despite a good portion of the audio clips being essentially unrecognizable to most English speaking people, the researchers found a similar level of activation in the same regions of the brain. This was balanced by making the subjects listen to non-human sounds such as white noise and static leading to no such activation in those specific parts of the brain (Latinus & Belin 2011). This indicates that there is something special about human speech and it is not the words being spoken, otherwise the sample that received the audio clips with the actual speaking part being removed, or even the sample that received foreign audio clips would not have had the same level of brain activity as the sample that received normal sentences being read in English. The only connection between the samples that had an activation in the language processing part of the brain was that they were read aloud by a human, showing something is in the voice that causes us to listen more intently, and consider the source a human.
  • 4.
    This is evidencedin “Recognition of Emotion from Vocal Cues” from Arch General Psychiatry where Johnson, Emde, Scherer, and Klinnert repeated Gobl and Chasaide’s study on artificial voices using a Swedish text-to-speech program, but instead using real human speakers told to imitate a certain emotion when speaking to the judges. The speakers were recorded, and the audio clips were given to the judges to decide what emotion was being shown in the clip. The first limitation of this study to note was that when told to act out a certain emotion, the speakers naturally exaggerated aspects of their speech to imitate it better. An interesting part of this study was that some of the clips were given to the judges scrambled up, as in the audio clip was reversed, clipped, passed through different vocal filters (which isolate specific frequencies of sound), and otherwise modified. The result was that the judges almost always guessed the correct emotion, even when given altered audio clips of the speakers (Johnson et al. 1986). Combined with the results of Latinus and Belin’s study which showed that the human voice has some special quality which is intrinsic and cannot be erased, this study by Johnson and his team found that that quality is emotion. Emotional data is a quality of human voice that does not change when the voice is flipped or distorted, and even when the entire spoken word portion of a voice is removed from an audio clip, emotional data is left behind. Most importantly, emotion is a trait of human voice that cannot be emulated by any current text-to-speech programs by any means. The problem of figuring out what to do with this emotional data goes hand-in-hand with finding specifically where emotion hides in human speech – the idea being if we know what makes something emotional, we can emulate it. The first real leap in emotional speech analysis was a patent filed in 2000 titled “Apparatus and Methods for Detecting Emotions in the Human Voice” (Liverman) which described a procedural, mathematical algorithm for categorizing and quantifying emotions in human speech. The first step to his algorithm is to break apart a single
  • 5.
    voice waveform (simplya sound wave of a spoken voice) into its multiple composite waveforms each of discrete frequencies. Any sound we make is really a vibration of multiple vocal chords, and what a special program called a spectrograph does is take the entire voice and isolate the waveforms made by each particular set of vocal chords. This is analogous to using a prism to refract white light into all the colors of a rainbow. Liverman states that the lowest few frequencies correlate strongly with high emotional activity. The next step find out what these emotions might be is to find something called a fundamental frequency (f0), essentially an average frequency of the lowest parts of a voice indicative of a person’s mood, according to Liverman, and based on the actual spoken word component of a voice a guess for the actual classification of the emotion (e.g. happiness, sadness, anger) can be made and quantified by analyzing other aspects of vocal topology. Ultimately, emotions are located using activity in the lowest frequencies of a voice, and characterized and quantified using f0 and various other traits of a voice. Characterizing a person’s voice does not stop there. Even with Liverman’s algorithm, holes still exist in the accuracy of our models. In an article titled “Acoustic Markers of Emotions based on Voice Physiology” published as a part of the Speech Prosody International Conference, Sona Patel and her team immediately disregarded the use of fundamental frequency (f0) as a metric of emotional classification. They argue that no scientifically verifiable study achieved statistically relevant results classifying a particular emotion from a person’s voice using f0 as a main metric. Patel’s team found a suitable classification metric which is also statistically verifiable by using other topological traits such as alpha ratio, Leg ratio, LTAS, mean characteristic frequencies, and varying types of shimmer (Patel et al. 2010). This directly contradicts the method Liverman was using to specify emotion. However, Liverman’s algorithm
  • 6.
    still described aprocedure to identify parts of a voice that signify high emotional activity which is supported by Latinus and Belin, and Johnson and his team’s research on the relevance of emotion in human voice. Patel’s team did replace Liverman’s dated assumption of the usefulness of f0 with a much more complicated multi-part metric spanning the whole topology of the voice (voice “topology” is used to describe physical characteristics of a voice that can be directly observed without analysis), showing that emotional characterization is not as simple as measuring one quantifiable property of voice. A sophisticated voice analysis engine is required that can measure multiple properties with extreme precision, leading to a dead-end to the quest of finding a simple way to emulate human emotion in speech. Since speech synthesis programs of the future should be able to work quickly without any delay in action, relying on the incorporation of all the qualities Patel’s team studied is impractical and points to a limitation in the current way we attempt to emulate speech. A promising alternative to the classic database text-to-speech program is something called a need-based emotional model, or NEMO. This concept was referred to by researcher Syahherah Lutfi and his team in their paper “I Feel You: The Design and Evaluation of a Domotic Affect-Sensitive Spoken Conversational Agent” in the journal Sensors. A NEMO is a machine learning text-to-speech program, most commonly seen with extra lossless frequency and amplitude modulation capabilities. When compared to a classical text-to-speech program, a NEMO ranks between twice and three times as effective at communicating emotion in speech with comparable speed and efficiency (Lutfi et al. 2013). The term machine learning usually refers to a neural network, a complex mathematical algorithm which accomplishes a task (in this case, transforming text to speech), analyzes the correctness of its result, and adjusts its procedures to ensure the next iteration is more accurate and conforms to the required model. This
  • 7.
    small relegation ispowerful as it means after a training period, the speech synthesis program is very accurate, leading to the results Lutfi and his team saw. Incorporating the details Patel’s team discovered would be as simple as training a NEMO with those parameters and letting it self- teach the correct concepts. At its heart, a NEMO seeks to emulate a human thinking process to ensure maximum emotional potency, thus at its best, a NEMO does not require a hard-wired catalog of things it can say. This is different than the classical text-to-speech database model which explicitly transforms sentences into speech. Currently, several text-to-speech applications are in use around the world for various things; Dennis Mitzner, chief editor of Inside3DP and commentator on a variety of technical issues for many popular news sources recently wrote about Voxdox, a text- to-speech app specifically catered towards dyslexics (Mitzner 2014). Since dyslexics have trouble reading text, reading the text out loud is an effective solution and text-to-speech apps allow this to happen without another reader being present. For simple tasks like this, there is no need for a NEMO-based system’s inherit complexity or flexibility, constraining the true usefulness of a NEMO to be applied directly to an artificial intelligence, such as Siri or Google Now already omnipresent in many of our lives. Imagine being able to talk to Siri or Google Now but in a completely informal setting without worrying whether or not the apps can comprehend you. Better yet, imagine Siri being able to talk back to you with any emotion she feels is relevant to the current situation. You tell a joke, and she laughs; you ask a ridiculous question, she gives back a sarcastic response; you sleep through your alarm in the morning and she’s angry when she wakes you up. While this sounds like living with a girlfriend or wife, it could very well be the future of artificial intelligence thanks to a NEMO calibrated to give the best response for you in particular. As
  • 8.
    devices become fasterand more wide-spread, increasingly more complex NEMO-based systems can be implemented into phones, cars, laptops, even office buildings while being connected to a huge artificial intelligence system behind the scenes. Secretary jobs may give way to a highly- tailored artificial intelligence, and cars can drive themselves while keeping you updated on everything that is going on at work, if and only if we have the capacity to build such systems. Configuring a machine to understand human demands is quite separate from programming one to think like a human, like a NEMO. While a NEMO might have a week-long training session to learn enough information to roughly emulate human emotion, we learn about everything for years of our lives. Our experiences and memories play a large role in how we think. For a sophisticated artificial intelligence system to exist, a NEMO would have to have the same learning chances which is impractical to send a NEMO into anywhere between 18 and 60 years of training for an average secretary position. Sophisticated object and command recognition can be built into a NEMO so it would not require as much training, but as illustrated by Google researcher Oroil Vinyals and his team at Cornell’s computer vision research project, computer object recognition still recognizes only around fifty percent of scenes correctly without major inaccuracies in its recognition (Google Research Blog 2014). Even the finest neural network-based computer vision programs have similar if not worse levels of failure in looking at a picture of Richmond, VA’s Monroe Park and attempting to describe it from anything between “Apple Tree” and “Large Field with Trees.” Very seldom do object recognition engines reach a level of competence that allow such a picture to be described as simply, a “Park.” While the previous descriptions were not inherently incorrect (there may or may not be an apple tree, and Monroe Park does have a few field-like areas), the overall description is flawed. This is analogous to a secretary telling you to fly to the Swiss Alps rather than the Andes mountain
  • 9.
    range. Fundamentally, mountainsare mountains, but if you happen to be an archeologist investigating the Inca civilization, Lindt chocolates do not help with your research. Speech analysis and synthesis have come a long way since McCormick’s team revealed their “artificial speech device” in the early ‘80s (McCormick et al. 1982). Several developments in speech analysis included a rudimentary look at what makes up human voice in terms of intrinsic properties. Emotion was revealed to be a major part of human voice and persisted even in the futile attempts to modify a voice by shearing, reversing, and filtering part of the audio by Johnson’s team (Latinus & Belin 2011). The fact that emotion could still be detected after heavy modification of the source audio pointed towards it being a physical property reveled in directly observable traits of the speech file rather than a result of analysis of verbal information (Johnson et al. 1986). This allowed Liverman to develop an algorithm describing the emotion hidden inside a voice by running the audio through a spectrograph and analyzing the lowest few frequencies and using other characteristics of the voice to determine classification (Liverman 2000). While Liverman’s method of classification was debunked by others such as Latinus and Belin as well as Patel and her team (Patel et al. 2010), his method of finding regions of high emotional activity was still relevant and could possibly be used to improve Patel’s team’s analysis. Since Patel’s team’s metrics of classification were so detailed and complicated to emulate when synthesizing voice from scratch, it is not much use in current speech synthesis techniques, pointing to a deficiency in current emulation techniques. A need-based emotional model is a suitable replacement for current database techniques by augmenting a database with a sophisticated neural network allowing for further emotional sensitivity. Since a NEMO emulates a human thought process, beyond a sophisticated text-to-speech engine, a NEMO can directly be applied to an artificial intelligence. Research in speech analysis has led to rapid product
  • 10.
    innovation taking usfrom a rudimentary medical device from the 1980s to globally useful applications such as Siri and Google Now today. Speech analysis is an invaluable field and is very important for the future of technology.
  • 11.
    Works Cited Gobl, C.Chasaide, A. (2003). The role of voice quality in communicating emotion, mood and attitude. Speech Communication, 40(1), 189-212. Google Research Blog. (2014, September 5). Building a deeper understanding of images. Author. Hughes, N. (2014). Tests find Apple's Siri improving, but Google Now voice search slightly better. AppleInsider. Latinus, M. Belin, P. (2011, February 22) Human Voice Perception. Current Biology, 21(4), R143-R145. http://dx.doi.org/10.1016/j.cub.2010.12.033 Liverman, Amir. (2000, April 11). Apparatus and Methods for Detecting Emotions in the Human Voice. International Application Published under the Patent Cooperation Treaty (PCT). Application Number: PCT/IL00/00216. Lutfi, S. L., Fernandez-Martinez, F., Lorenzo-Trueba, J., Barra-Chicote, R., Montero, J. M. (2013). I Feel You: The Design and Evaluation of a Domotic Affect-Sensitive Spoken Conversational Agent. Sensors, 13(8), 10519. McCormick, G. P. White, M. J. Simtoski, F. J. Stanley L. H., McGuinness, K. (1982, January). Artificial Speech Devices. The American Journal of Nursing, 82(1), 121-122. Mitzner, D. (2014). Text-to-Speech apps aid Students with Dyslexia. InformationWeek. Patel, S., Scherer, K. R., Sundberg, J., Bjorkner, E. (2010). Acoustic Markers of Emotions Based on Voice Physiology. Speech Prosody International Conference.