Imaging the human voice

of 8
Imaging the Human Voice
As a Three Dimensional Surface
Ebe Helm
ABSTRACT
Nature often hides its most interesting qualities and patterns until curiosity or imagination conceives new
ways to discover and visualize them. The Nautilus shell and the Barnsley fern are just two examples of this.
The possibility that human speech, if viewed in new ways, and in the expanded perspective of a three
dimensional surface might also exhibit such patterns was the basis for this investigation. To a more practical
application the goal was to determine if a new approach might yield such observable patterns as would
expand the science of computer speech recognition, and in such a way as would make the topic more
approachable to a general audience.
Introduction
This paper is a follow-on effort to an earlier work that focused primarily on examining acoustic waveforms in the
more traditional linear time domain, and attempted to expand on some of the challenges encountered in computer
speech recognition [1]. A function for separating the waveforms without the application of Discrete Fourier
Transforms or the use of Mel Frequency Cepstral Coefficients, as well as the considerations of Dynamic Time Warping
were demonstrated. This presentation continues from that point by providing additional perspective beyond two
dimensions. The goal of the application was founded in two questions. What does a word actually look like? Is it
possible to associate the meaning of a sound by its appearance [2]. The questions become more complex because
the perceptions are governed by the methods used to render the images. Traditionally, and more commonly
encountered, are both the linear time-line and the two-dimensional Sonographs as shown below. This initial
hypothesis required a means for transforming coordinate data from a form of ordered triples into an ordered pair
such that these could be displayed and rotated on a virtual plane. While this was accomplished, it was a remarkable
experience to observe that when the acoustic waveforms were processed and displayed, they exhibited an
inherently three-dimensional quality all their own. It was not necessary to force the perspective of three dimensions.
It appears that the quality was already there.
The following five sections demonstrate techniques for processing complex audio waveforms providing: 1)
separation of the waveform into individual lines of constituent frequencies. Including the Fundamental [sometimes
known as the Glottal] frequency. 2) Generating waveform profiles that may be viewed and rotated as a three
dimensional surface using co-ordinate transformation. 3) Examination of waveforms across long and short intervals
of the time domain. 4) A description and illustration comparing the frequency profile across a variable time domain
and how this profile might yield unique and recognizable patterns to individual phonetic structures. 5) Finally, a
means for reducing the background noise floor and separating individual words and consonants is illustrated.
Linear waveform Two dimensional Spectragram
The more traditional examples of linear waveform rendering (left) and a two-dimensional Spectrogram sometimes
known as Sonograms, Sonographs, voice prints, or spectrographs shown (right).

of 8
I. Waveform Filtering
Fast Fourier Transforms (FFT), Discrete Fourier Transforms (DFT), and Mel Frequency Cepstral Coefficients (MFCC)
have traditionally been the chosen means for separating complex waveforms into their constituent frequencies [3]
[4] [5] [6] [7]. One of the first objectives in this effort was to find an alternative that would accomplish essentially
the same thing. Perhaps by smoothing away the higher frequencies from the lower frequencies, and revealing detail
that is otherwise occluded. The desired result was observed by taking the average of f(x) and the two point’s f(x-n)
and f(x+n) on either side. On the first iteration of this function the waveform immediately displayed the higher
frequencies and lower energies of the consonants |t| and |th| as shown in figure 1a below. These energies are
normally all but indiscernible when viewed as a composite two dimensional waveform. As the number of iterations
of this function increased, the waveforms smoothed out to reveal the lower frequency and higher energies until the
fundamental frequency itself came into relief. Figure 1b. With each following iteration, the values of n are increased
outward on either side of f(x) in what might be referred to as an expanding average. The results however, are not
immediately applied to the line f(x). They are kept in a buffer such as not to effect the following iterations of f(x). In
this way the algorithm might be described as semi-recursive. Only after all values in the line f(x) have been calculated,
are they moved to become g(x). This is not necessarily implicit in the equation below, but does have significant effect
on the resultant data. Initially, a significant amount of noise was observed with each line iteration. A variety of
techniques and functions were explored to filter and remove this noise from the signal with varying degrees of
success. Ultimately it was found that by simply subtracting the previous line from the current line, the noise was
effectively removed. This was increasingly evident with regards to the lower frequency noise. As a side effect, the
overall waveform demonstrated a more accurately defined shape and envelope.
𝑊𝑎𝑣𝑒𝑓𝑜𝑟𝑚 𝐹𝑖𝑙𝑡𝑒𝑟𝑖𝑛𝑔 → 𝑔(𝑥)∘𝑖
= (
𝑓(𝑥 − 𝑛) + 𝑓( 𝑥) + 𝑓(𝑥 + 𝑛)
3
)  ℎ ∘ 𝑔(𝑥)∘𝑖
= 𝑔(𝑥)∘𝑖
− 𝑔(𝑥)∘𝑖−1
The depth and distribution of the waveform on a two dimensional plane, from highest to lowest frequencies, is also
effected by the rate of the increasing values of the two elements –n and +n from f(x). Treated as a non-linear
function, it is possible to effect and scale the distribution of the overall envelope of the waveform across the 100
wave-lines of frequencies resolved. As illustrated in section II following below. In this case 100 lines: Where → i = 1
to 100 lines and n = i + i⁄2. The results of frequency filtering are shown above. With the first iteration, the highest
frequencies come into relief showing the otherwise less apparent lower energy of |t| in “Testing” and “Two” and
|th| in the word “Three”. Figure 1a. The fundamental frequency of 147Hz is shown at 50 iterations. Figure 1b. The
significance of the range in time domain is illustrated here. The higher frequencies relating to soft pallet sounds are
easily seen with no magnification at all. While the glottal sounds and fundamental frequency are more apparent at
a 75 millisecond window.
figure 1a: n ≈ 1 iterations. 400:1 compresion 2000 milliseconds.
figure 1b: n ≈ 50 iterations. 15:1 compresion 75 milliseconds.

of 8
II. Waveform Profile
These complex waveforms must be examined across a broad range of magnification in the time domain. From
seconds to milliseconds. In the longer time period views the signal is better perceived by looking at it as a profile.
This profile can be obtained with an arithmetic mean function as shown below.
𝑊𝑎𝑣𝑒𝑓𝑜𝑟𝑚 𝑃𝑟𝑜𝑓𝑖𝑙𝑒 →
1
𝑛
∑ |𝑓(𝑥)|
𝑛+𝑚
𝑥=𝑚
𝑛 = (𝑠𝑎𝑚𝑝𝑙𝑒𝑠/𝑠𝑒𝑐)/100 𝑚 = −𝑛 2
⁄
The process of averaging all this data was found to be processor intensive and slowed rendering of the waveforms
considerably. To reduce the rendering time, the function value calculated at a given point is copied into a range of
following elements. The next calculation point is then advanced by that number of elements. This effectively reduces
the number of function iterations. This is not implicit in the ‘Waveform Profile’ equation above, but rather managed
in the program code. The effect is a significantly increased rendering speed with acceptable results of imaging the
waveform as shown in figure 2b below.
figure 2a figure 2b
One of the most interesting observations was made when all 100 data lines were stacked one atop another and
displayed simultaneously. The resemblance to a two-dimensional sonograph, like the one shown on page 1, was
immediately evident as in figure 2c. Note: The amplitude of each individual line is scaled to a min/max of 0-100 for
the purposes of calculations and display. This is what makes bringing the higher frequency (lower energy)
components into relief possible. The soft pallet sounds of |t| and |th| are not normally visible in the composite
waveform, but clearly standout with a topographical quality when processed in this way.
figure 2c figure 2d
In extending the results of the Waveform Profile technique to view to the entire array of lines, the image began to
display an interesting degree of detail. This in contrast to the fact that the resolution of the data [in the profile view]
had actually been smoothed away. It was at this point that the [naturally occurring] three dimensional quality of
these waveforms first resolved. It was only on magnification to the millisecond level where the same quality became
dramatically apparent with the raw data.

of 8
III. Coordinate Transformations
Initially the thought was that recognizing human speech patterns might be possible by bringing their phonetic
patterns into relief as a three dimensional surface. These structures, impossible to see in linear plots, and barely
discernable in two-dimensional spectrographs might possibly become apparent if viewed in this way. The first goal
was to find or develop a means of coordinate transformation of the ordered triples (x, y, z) into ordered pairs (x’, y’)
such that they could be rotated on a virtual plane. On this surface x would remain the time domain, while z would
extend along the range of the individual frequencies derived, and y would continue to represent the amplitude of
values of the signal f(x). These could then be displayed on a computer screen. A modification of the two-dimensional
trigonometric identity for the addition of angles, and subsequently the addition of a second angle of rotation
provided the desired result for creating a rotational plane. To visualize the concept, hold a cylindrical object, perhaps
a drinking glass. Imagine the rim of this glass existing only on a two-dimensional plane. Viewing the glass in this way,
its rim should first appear as a straight line. Now tilt this object forward about its x-axis and observe the rim
transforming into an ellipse. Finally, as the glass continues to tilt about x, the rim becomes a circle. Notice also that
as the glass is tilted, the rim also translates down along the y-axis. The final modification to this identity provided for
the effect of an ordered triple to represent a point on orbit about the origin of a plane laid tangent onto the surface
of a sphere.
Figure 3a
𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 → 𝑥′ = 𝑥𝐶𝑜𝑠𝜃 − 𝑧𝑆𝑖𝑛𝜃 𝑎𝑛𝑑 𝑦′ = 𝑆𝑖𝑛𝜙(𝑥𝑆𝑖𝑛𝜃 + 𝑧𝐶𝑜𝑠𝜃) + 𝑦𝐶𝑜𝑠𝜙
It may be beneficial to consider an analogy. The antiquated NTSC television transmission system might be described
as one of the most complex analog encoding systems evolved before the advent of digital transmission. In this
system, specialized oscilloscopes are used for observing the various aspects of the composited signal. The frame
rate at fractions of a second. The line rate in microseconds. Finally, the color subcarrier is observed and measured
in nanoseconds. It is of course impossible to visualize all aspects of this signal at the same time. They must be
observed in incremental steps as one would first use the unaided eye, then a magnifying glass, and finally a
microscope. The analogy is relevant here because, like the old TV broadcasts, recognizable components of a spoken
word also appear to exist over a wide range in the time domain. Hence the need of being able to zoom in and out
from whole seconds to milliseconds becomes even more important.

of 8
Of particular import also, is that the data is not only being shown over a broadly varying range of time, but also that
it is being represented as both a profile as in figure 3a, and as the raw data again as in figure 3b. The overall structure
of the words is more easily seen in the long time domain if shown as a smoothed profile, however the raw data itself
requires no smoothing when zoomed in to short time intervals.
Figure 3b
In figure 3a the high frequency energies of |t| and |th| are easily seen at the top of the plot, while the more subtle
phonetic structures and harmonics of consonants and vowels such as |a| and |ah| are examined at magnifications
of twenty five milliseconds shown in figure 3b. It is necessary to view the patterns of the spoken word both near and
far to comprehend the relevant features that make a word unique. It may be interesting to note that two seconds
of recorded speech shown in figure 3a requires a compression of 400:1. Were the sample expanded to reveal its full
detail at 1:1 compression, the data would require a computer screen seventy five feet wide. The contrast is important
here to further demonstrate the diversity and scope to which the unique patterns of a word extend. While the higher
frequency soft pallet sounds are easily seen in real time, the subtle differences in vowels require closer examination.
Having successfully rendered waveforms from a virtual 3D plane onto the 2D display, the need for accurately tracking
a cursor across the two planes for measurments and selections becomes evident. In effect, it is nessesary to reverse
the coordinate transformation between the two planes. The ordered pair of (x, y) represent the mouse pointer
coordiantes on the computer display and are transfomed to the virtual plane as (x’, y’) as shown in the ‘Surface
cursor lines’ equation below. This allows the mouse pointer to more accurately relate position and track across the
virtual 3D plane. Note: n in y’ compensates for a 10:1 ratio between 1000 data points across 100 data lines. In this
case the value of n = 10.
𝑆𝑢𝑟𝑓𝑎𝑐𝑒 𝑐𝑢𝑟𝑠𝑜𝑟 𝑙𝑖𝑛𝑒𝑠 → 𝑥′ = (
𝑦
𝑆𝑖𝑛𝜙
) 𝑆𝑖𝑛𝜃 + 𝑥𝐶𝑜𝑠𝜃 𝑎𝑛𝑑 𝑦′ = (
𝑦
𝑆𝑖𝑛𝜙/𝑛
) 𝐶𝑜𝑠𝜃 – 𝑥𝑆𝑖𝑛𝜃
While it was discovered that the data had a naturally occurring three dimensional quality all its own, there was still
a distinct advantage to being able to examine the images from a continuously variable perspective. Nuances of form
and shape that would otherwise be unnoticed became more evident.

of 8
IV. Spectrum sampling
The symmetry of harmonics in the waveform may be brought into greater contrast by removing the negative going
values in a way analogous to Nyquist filtering. One of the first observations in doing this was that the positive and
negative sides of the waveforms are not symmetrical. It appears that both positive and negative going aspects of the
waveform could be relevant for discriminating patterns for recognition. Likewise the details and subtleties of
spectrum of these waveforms across the 100 data lines are also brought into greater relief when observed in varying
three-dimensional perspectives. Figures 4a and 4b below.
Figure 4a Figure 4b
As can be seen in these illustrations, the fundamental frequency is clearly evident, but more importantly a spectral
profile at any given instant across the time domain is now also shown. The need for techniques such as ‘Dynamic
Time Warping’ [5] [8] to fit and match waveform patterns is not neccessary, as it does not matter where these
patterns occur in the time domain. Only that they do occur relative to the fundamental. Figures 4c and 4d below.
figure 4c figure 4d
The ability to observe these systems in three dimensions affords a unique visual perspective for the two components
of amplitude and frequency. In addition to this, another perspective is made possible by observing a sampling of
both amplitude and frequency for a chosen time period. As a starting point, the fundamental frequency is a key
reference. The frequency/amplitude profiles are then summed to form a two dimensional pattern. The visulaizations
obtained may suggest the potential for recognizable patterns. The overall shape of these patterns may in fact be
unique and recognizable while at the same time independent of amplitude, frequency and time. An example of this
concept is shown below as a period is taken from a point on the fundamental zero crossing at -/4 to /4 Figures 4e
and 4f
figur 4e figure 4f

of 8
V. Noise Floor Suppression
The general purpose of the Wavescope platform was to provide as limitless an environment as possible for exploring
techniques in manipulating waveform data. As an example. Noise floor suppression continues to be an important
subject for improving audio quality of both speech and music. It also relates to the quandary of separating what is
not wanted from what is wanted. Applied to computer speech recognition, removing the noise floor from a signal
also provides separation of elements allowing for pattern recognition. Separating ‘connected speech’ has long been
one of the greatest challenges in computer speech recognition. This concept also extends to the goal of separation
and isolation of individual consonants and vowels.
𝑁𝑜𝑖𝑠𝑒 𝐹𝑙𝑜𝑜𝑟 𝑆𝑢𝑝𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 → ℎ(𝑥) = 𝑔 ∘ 𝑓(𝑥) (1 − (
𝑔 ∘ 𝑓(𝑥) 𝑚𝑎𝑥
− 𝑔 ∘ 𝑓(𝑥)
𝑔 ∘ 𝑓(𝑥) 𝑚𝑎𝑥
))
The equation above provides signal attenuation inverse to the amplitude of g f(x). As the signal level increases, the
amount of attenuation decreases. This demonstrates eliminating lower level noise while affecting the desired signal
in an increasingly lesser degree as the amplitude increases. The important distinction in this case is that this
attenuation can now be applied to each of the 100 wave-data lines individually. The ability to discreetly filter each
line provides for a more exact and discriminating result. The example above was inspired from the more commonly
known techniques of μ-law and A-law compression and expansion algorithms used in telecommunications to limit
bandwidth.
figure 5a figure 5b
Shown in figures 5a and 5b, the filter is applied to the frequency lines individually as opposed to applying the function
to the unprocessed composite waveform as a whole. As a result, the effectiveness of the filter appears to increase
as its application is more selective. Separation of consonants and vowels begin to come into relief, and the |t| and
|th| and |s| sounds are shown more clearly separated from the lower frequencies..
Conclusion
Computer speech recognition has been attained and mostly perfected. What has not been perfected is a general
accessibility and understanding of the subject. It remains one of both esoteric obscurity and significantly advanced
mathematics to those wishing to explore, or expand the science with new concepts. Perhaps a next logical effort
might be to explore pattern matching techniques using the spectrum profiles found using these or similar
techniques. As described earlier, this could prove an effective means of removing the challenges related to Dynamic
Time Warping, as matching these profiles to sample patterns would not be subject to alignment in the time domain.
The zero-crossings of the fundamental frequency might be used as a reference for selecting and extracting the
spectral profile of a specific period in time. Finally, the term ‘Wavescope’ was given to this program as a descriptive
akin to a telescope or microscope. What might be seen is usually entirely unknown until the thing is built and one
looks through it. That was the purpose of the program. To see what has never been seen. Perhaps in a way that it
has never been seen before. It is intended, and hoped that the techniques and equations presented here would
prove sufficient to reproduce the results shown by anyone wishing to do so.

of 8
References
[1] Y. Chow, M. Dunham, O. Kimball, M. Krasner, Kubala, J. G. Makhoul, S. Price, S. Roucos and R. M.
Schwarz, "BYBLOS: The BBN Continuous Speech Recognition System," vol. 12, pp. 89-92, 1987.
[2] E. P. Lewenstein and D. Musello, "His Master’s (Digital) Voice," Time, vol. 125, no. 13, pp. 83-84, 1
April 1985.
[3] M. Gales and S. Young, "The Application of Hidden Markov Models in Speech Recognition,"
Foundations and Trends in Signal Processing, vol. 1, no. 3, pp. 195-304, 2007.
[4] S. Levinson, "Continuous speech recognition by means of acoustic/ Phonetic classification obtained
from a hidden Markov mode," in IEEE International Conference on ICASSP, Acoustics, Speech, and
Signal Processing, 1987.
[5] L. Muda, M. Begam and I. Elamvazuthi, "Voice Recognition Algorithms using Mel Frequency Cepstral
Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques," JOURNAL OF COMPUTING, vol.
2, no. 3, March 2010.
[6] D. B. Paul, "Speech Recognition Using Hidden Markov Models," The Lincoln Laboratory Journal, vol.
3, no. 1, 1990.
[7] W. Ward, "Hidden Markov Models In Speech Recognition," Carnegie Mellon University, Pittsburgh.
[8] Eamonn J. Keogh and Michael J. Pazzani, "Derivative Dynamic Time Warping,," in Proceedings of the
2001 SIAM International Conference on Data Mining, 2001.

Imaging the human voice

Recommended

Recommended

More Related Content

Similar to Imaging the human voice

Similar to Imaging the human voice (20)

Recently uploaded

Recently uploaded (20)

Imaging the human voice