SlideShare a Scribd company logo
Page 1 of 8
Imaging the Human Voice
As a Three Dimensional Surface
Ebe Helm
ABSTRACT
Nature often hides its most interesting qualities and patterns until curiosity or imagination conceives new
ways to discover and visualize them. The Nautilus shell and the Barnsley fern are just two examples of this.
The possibility that human speech, if viewed in new ways, and in the expanded perspective of a three
dimensional surface might also exhibit such patterns was the basis for this investigation. To a more practical
application the goal was to determine if a new approach might yield such observable patterns as would
expand the science of computer speech recognition, and in such a way as would make the topic more
approachable to a general audience.
Introduction
This paper is a follow-on effort to an earlier work that focused primarily on examining acoustic waveforms in the
more traditional linear time domain, and attempted to expand on some of the challenges encountered in computer
speech recognition [1]. A function for separating the waveforms without the application of Discrete Fourier
Transforms or the use of Mel Frequency Cepstral Coefficients, as well as the considerations of Dynamic Time Warping
were demonstrated. This presentation continues from that point by providing additional perspective beyond two
dimensions. The goal of the application was founded in two questions. What does a word actually look like? Is it
possible to associate the meaning of a sound by its appearance [2]. The questions become more complex because
the perceptions are governed by the methods used to render the images. Traditionally, and more commonly
encountered, are both the linear time-line and the two-dimensional Sonographs as shown below. This initial
hypothesis required a means for transforming coordinate data from a form of ordered triples into an ordered pair
such that these could be displayed and rotated on a virtual plane. While this was accomplished, it was a remarkable
experience to observe that when the acoustic waveforms were processed and displayed, they exhibited an
inherently three-dimensional quality all their own. It was not necessary to force the perspective of three dimensions.
It appears that the quality was already there.
The following five sections demonstrate techniques for processing complex audio waveforms providing: 1)
separation of the waveform into individual lines of constituent frequencies. Including the Fundamental [sometimes
known as the Glottal] frequency. 2) Generating waveform profiles that may be viewed and rotated as a three
dimensional surface using co-ordinate transformation. 3) Examination of waveforms across long and short intervals
of the time domain. 4) A description and illustration comparing the frequency profile across a variable time domain
and how this profile might yield unique and recognizable patterns to individual phonetic structures. 5) Finally, a
means for reducing the background noise floor and separating individual words and consonants is illustrated.
Linear waveform Two dimensional Spectragram
The more traditional examples of linear waveform rendering (left) and a two-dimensional Spectrogram sometimes
known as Sonograms, Sonographs, voice prints, or spectrographs shown (right).
Page 2 of 8
I. Waveform Filtering
Fast Fourier Transforms (FFT), Discrete Fourier Transforms (DFT), and Mel Frequency Cepstral Coefficients (MFCC)
have traditionally been the chosen means for separating complex waveforms into their constituent frequencies [3]
[4] [5] [6] [7]. One of the first objectives in this effort was to find an alternative that would accomplish essentially
the same thing. Perhaps by smoothing away the higher frequencies from the lower frequencies, and revealing detail
that is otherwise occluded. The desired result was observed by taking the average of f(x) and the two point’s f(x-n)
and f(x+n) on either side. On the first iteration of this function the waveform immediately displayed the higher
frequencies and lower energies of the consonants |t| and |th| as shown in figure 1a below. These energies are
normally all but indiscernible when viewed as a composite two dimensional waveform. As the number of iterations
of this function increased, the waveforms smoothed out to reveal the lower frequency and higher energies until the
fundamental frequency itself came into relief. Figure 1b. With each following iteration, the values of n are increased
outward on either side of f(x) in what might be referred to as an expanding average. The results however, are not
immediately applied to the line f(x). They are kept in a buffer such as not to effect the following iterations of f(x). In
this way the algorithm might be described as semi-recursive. Only after all values in the line f(x) have been calculated,
are they moved to become g(x). This is not necessarily implicit in the equation below, but does have significant effect
on the resultant data. Initially, a significant amount of noise was observed with each line iteration. A variety of
techniques and functions were explored to filter and remove this noise from the signal with varying degrees of
success. Ultimately it was found that by simply subtracting the previous line from the current line, the noise was
effectively removed. This was increasingly evident with regards to the lower frequency noise. As a side effect, the
overall waveform demonstrated a more accurately defined shape and envelope.
𝑊𝑎𝑣𝑒𝑓𝑜𝑟𝑚 𝐹𝑖𝑙𝑡𝑒𝑟𝑖𝑛𝑔 → 𝑔(𝑥)∘𝑖
= (
𝑓(𝑥 − 𝑛) + 𝑓( 𝑥) + 𝑓(𝑥 + 𝑛)
3
)  ℎ ∘ 𝑔(𝑥)∘𝑖
= 𝑔(𝑥)∘𝑖
− 𝑔(𝑥)∘𝑖−1
The depth and distribution of the waveform on a two dimensional plane, from highest to lowest frequencies, is also
effected by the rate of the increasing values of the two elements –n and +n from f(x). Treated as a non-linear
function, it is possible to effect and scale the distribution of the overall envelope of the waveform across the 100
wave-lines of frequencies resolved. As illustrated in section II following below. In this case 100 lines: Where → i = 1
to 100 lines and n = i + i⁄2. The results of frequency filtering are shown above. With the first iteration, the highest
frequencies come into relief showing the otherwise less apparent lower energy of |t| in “Testing” and “Two” and
|th| in the word “Three”. Figure 1a. The fundamental frequency of 147Hz is shown at 50 iterations. Figure 1b. The
significance of the range in time domain is illustrated here. The higher frequencies relating to soft pallet sounds are
easily seen with no magnification at all. While the glottal sounds and fundamental frequency are more apparent at
a 75 millisecond window.
figure 1a: n ≈ 1 iterations. 400:1 compresion 2000 milliseconds.
figure 1b: n ≈ 50 iterations. 15:1 compresion 75 milliseconds.
Page 3 of 8
II. Waveform Profile
These complex waveforms must be examined across a broad range of magnification in the time domain. From
seconds to milliseconds. In the longer time period views the signal is better perceived by looking at it as a profile.
This profile can be obtained with an arithmetic mean function as shown below.
𝑊𝑎𝑣𝑒𝑓𝑜𝑟𝑚 𝑃𝑟𝑜𝑓𝑖𝑙𝑒 →
1
𝑛
∑ |𝑓(𝑥)|
𝑛+𝑚
𝑥=𝑚
𝑛 = (𝑠𝑎𝑚𝑝𝑙𝑒𝑠/𝑠𝑒𝑐)/100 𝑚 = −𝑛 2
⁄
The process of averaging all this data was found to be processor intensive and slowed rendering of the waveforms
considerably. To reduce the rendering time, the function value calculated at a given point is copied into a range of
following elements. The next calculation point is then advanced by that number of elements. This effectively reduces
the number of function iterations. This is not implicit in the ‘Waveform Profile’ equation above, but rather managed
in the program code. The effect is a significantly increased rendering speed with acceptable results of imaging the
waveform as shown in figure 2b below.
figure 2a figure 2b
One of the most interesting observations was made when all 100 data lines were stacked one atop another and
displayed simultaneously. The resemblance to a two-dimensional sonograph, like the one shown on page 1, was
immediately evident as in figure 2c. Note: The amplitude of each individual line is scaled to a min/max of 0-100 for
the purposes of calculations and display. This is what makes bringing the higher frequency (lower energy)
components into relief possible. The soft pallet sounds of |t| and |th| are not normally visible in the composite
waveform, but clearly standout with a topographical quality when processed in this way.
figure 2c figure 2d
In extending the results of the Waveform Profile technique to view to the entire array of lines, the image began to
display an interesting degree of detail. This in contrast to the fact that the resolution of the data [in the profile view]
had actually been smoothed away. It was at this point that the [naturally occurring] three dimensional quality of
these waveforms first resolved. It was only on magnification to the millisecond level where the same quality became
dramatically apparent with the raw data.
Page 4 of 8
III. Coordinate Transformations
Initially the thought was that recognizing human speech patterns might be possible by bringing their phonetic
patterns into relief as a three dimensional surface. These structures, impossible to see in linear plots, and barely
discernable in two-dimensional spectrographs might possibly become apparent if viewed in this way. The first goal
was to find or develop a means of coordinate transformation of the ordered triples (x, y, z) into ordered pairs (x’, y’)
such that they could be rotated on a virtual plane. On this surface x would remain the time domain, while z would
extend along the range of the individual frequencies derived, and y would continue to represent the amplitude of
values of the signal f(x). These could then be displayed on a computer screen. A modification of the two-dimensional
trigonometric identity for the addition of angles, and subsequently the addition of a second angle of rotation
provided the desired result for creating a rotational plane. To visualize the concept, hold a cylindrical object, perhaps
a drinking glass. Imagine the rim of this glass existing only on a two-dimensional plane. Viewing the glass in this way,
its rim should first appear as a straight line. Now tilt this object forward about its x-axis and observe the rim
transforming into an ellipse. Finally, as the glass continues to tilt about x, the rim becomes a circle. Notice also that
as the glass is tilted, the rim also translates down along the y-axis. The final modification to this identity provided for
the effect of an ordered triple to represent a point on orbit about the origin of a plane laid tangent onto the surface
of a sphere.
Figure 3a
𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 → 𝑥′ = 𝑥𝐶𝑜𝑠𝜃 − 𝑧𝑆𝑖𝑛𝜃 𝑎𝑛𝑑 𝑦′ = 𝑆𝑖𝑛𝜙(𝑥𝑆𝑖𝑛𝜃 + 𝑧𝐶𝑜𝑠𝜃) + 𝑦𝐶𝑜𝑠𝜙
It may be beneficial to consider an analogy. The antiquated NTSC television transmission system might be described
as one of the most complex analog encoding systems evolved before the advent of digital transmission. In this
system, specialized oscilloscopes are used for observing the various aspects of the composited signal. The frame
rate at fractions of a second. The line rate in microseconds. Finally, the color subcarrier is observed and measured
in nanoseconds. It is of course impossible to visualize all aspects of this signal at the same time. They must be
observed in incremental steps as one would first use the unaided eye, then a magnifying glass, and finally a
microscope. The analogy is relevant here because, like the old TV broadcasts, recognizable components of a spoken
word also appear to exist over a wide range in the time domain. Hence the need of being able to zoom in and out
from whole seconds to milliseconds becomes even more important.
Page 5 of 8
Of particular import also, is that the data is not only being shown over a broadly varying range of time, but also that
it is being represented as both a profile as in figure 3a, and as the raw data again as in figure 3b. The overall structure
of the words is more easily seen in the long time domain if shown as a smoothed profile, however the raw data itself
requires no smoothing when zoomed in to short time intervals.
Figure 3b
In figure 3a the high frequency energies of |t| and |th| are easily seen at the top of the plot, while the more subtle
phonetic structures and harmonics of consonants and vowels such as |a| and |ah| are examined at magnifications
of twenty five milliseconds shown in figure 3b. It is necessary to view the patterns of the spoken word both near and
far to comprehend the relevant features that make a word unique. It may be interesting to note that two seconds
of recorded speech shown in figure 3a requires a compression of 400:1. Were the sample expanded to reveal its full
detail at 1:1 compression, the data would require a computer screen seventy five feet wide. The contrast is important
here to further demonstrate the diversity and scope to which the unique patterns of a word extend. While the higher
frequency soft pallet sounds are easily seen in real time, the subtle differences in vowels require closer examination.
Having successfully rendered waveforms from a virtual 3D plane onto the 2D display, the need for accurately tracking
a cursor across the two planes for measurments and selections becomes evident. In effect, it is nessesary to reverse
the coordinate transformation between the two planes. The ordered pair of (x, y) represent the mouse pointer
coordiantes on the computer display and are transfomed to the virtual plane as (x’, y’) as shown in the ‘Surface
cursor lines’ equation below. This allows the mouse pointer to more accurately relate position and track across the
virtual 3D plane. Note: n in y’ compensates for a 10:1 ratio between 1000 data points across 100 data lines. In this
case the value of n = 10.
𝑆𝑢𝑟𝑓𝑎𝑐𝑒 𝑐𝑢𝑟𝑠𝑜𝑟 𝑙𝑖𝑛𝑒𝑠 → 𝑥′ = (
𝑦
𝑆𝑖𝑛𝜙
) 𝑆𝑖𝑛𝜃 + 𝑥𝐶𝑜𝑠𝜃 𝑎𝑛𝑑 𝑦′ = (
𝑦
𝑆𝑖𝑛𝜙/𝑛
) 𝐶𝑜𝑠𝜃 – 𝑥𝑆𝑖𝑛𝜃
While it was discovered that the data had a naturally occurring three dimensional quality all its own, there was still
a distinct advantage to being able to examine the images from a continuously variable perspective. Nuances of form
and shape that would otherwise be unnoticed became more evident.
Page 6 of 8
IV. Spectrum sampling
The symmetry of harmonics in the waveform may be brought into greater contrast by removing the negative going
values in a way analogous to Nyquist filtering. One of the first observations in doing this was that the positive and
negative sides of the waveforms are not symmetrical. It appears that both positive and negative going aspects of the
waveform could be relevant for discriminating patterns for recognition. Likewise the details and subtleties of
spectrum of these waveforms across the 100 data lines are also brought into greater relief when observed in varying
three-dimensional perspectives. Figures 4a and 4b below.
Figure 4a Figure 4b
As can be seen in these illustrations, the fundamental frequency is clearly evident, but more importantly a spectral
profile at any given instant across the time domain is now also shown. The need for techniques such as ‘Dynamic
Time Warping’ [5] [8] to fit and match waveform patterns is not neccessary, as it does not matter where these
patterns occur in the time domain. Only that they do occur relative to the fundamental. Figures 4c and 4d below.
figure 4c figure 4d
The ability to observe these systems in three dimensions affords a unique visual perspective for the two components
of amplitude and frequency. In addition to this, another perspective is made possible by observing a sampling of
both amplitude and frequency for a chosen time period. As a starting point, the fundamental frequency is a key
reference. The frequency/amplitude profiles are then summed to form a two dimensional pattern. The visulaizations
obtained may suggest the potential for recognizable patterns. The overall shape of these patterns may in fact be
unique and recognizable while at the same time independent of amplitude, frequency and time. An example of this
concept is shown below as a period is taken from a point on the fundamental zero crossing at -/4 to /4 Figures 4e
and 4f
figur 4e figure 4f
Page 7 of 8
V. Noise Floor Suppression
The general purpose of the Wavescope platform was to provide as limitless an environment as possible for exploring
techniques in manipulating waveform data. As an example. Noise floor suppression continues to be an important
subject for improving audio quality of both speech and music. It also relates to the quandary of separating what is
not wanted from what is wanted. Applied to computer speech recognition, removing the noise floor from a signal
also provides separation of elements allowing for pattern recognition. Separating ‘connected speech’ has long been
one of the greatest challenges in computer speech recognition. This concept also extends to the goal of separation
and isolation of individual consonants and vowels.
𝑁𝑜𝑖𝑠𝑒 𝐹𝑙𝑜𝑜𝑟 𝑆𝑢𝑝𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 → ℎ(𝑥) = 𝑔 ∘ 𝑓(𝑥) (1 − (
𝑔 ∘ 𝑓(𝑥) 𝑚𝑎𝑥
− 𝑔 ∘ 𝑓(𝑥)
𝑔 ∘ 𝑓(𝑥) 𝑚𝑎𝑥
))
The equation above provides signal attenuation inverse to the amplitude of g f(x). As the signal level increases, the
amount of attenuation decreases. This demonstrates eliminating lower level noise while affecting the desired signal
in an increasingly lesser degree as the amplitude increases. The important distinction in this case is that this
attenuation can now be applied to each of the 100 wave-data lines individually. The ability to discreetly filter each
line provides for a more exact and discriminating result. The example above was inspired from the more commonly
known techniques of μ-law and A-law compression and expansion algorithms used in telecommunications to limit
bandwidth.
figure 5a figure 5b
Shown in figures 5a and 5b, the filter is applied to the frequency lines individually as opposed to applying the function
to the unprocessed composite waveform as a whole. As a result, the effectiveness of the filter appears to increase
as its application is more selective. Separation of consonants and vowels begin to come into relief, and the |t| and
|th| and |s| sounds are shown more clearly separated from the lower frequencies..
Conclusion
Computer speech recognition has been attained and mostly perfected. What has not been perfected is a general
accessibility and understanding of the subject. It remains one of both esoteric obscurity and significantly advanced
mathematics to those wishing to explore, or expand the science with new concepts. Perhaps a next logical effort
might be to explore pattern matching techniques using the spectrum profiles found using these or similar
techniques. As described earlier, this could prove an effective means of removing the challenges related to Dynamic
Time Warping, as matching these profiles to sample patterns would not be subject to alignment in the time domain.
The zero-crossings of the fundamental frequency might be used as a reference for selecting and extracting the
spectral profile of a specific period in time. Finally, the term ‘Wavescope’ was given to this program as a descriptive
akin to a telescope or microscope. What might be seen is usually entirely unknown until the thing is built and one
looks through it. That was the purpose of the program. To see what has never been seen. Perhaps in a way that it
has never been seen before. It is intended, and hoped that the techniques and equations presented here would
prove sufficient to reproduce the results shown by anyone wishing to do so.
Page 8 of 8
References
[1] Y. Chow, M. Dunham, O. Kimball, M. Krasner, Kubala, J. G. Makhoul, S. Price, S. Roucos and R. M.
Schwarz, "BYBLOS: The BBN Continuous Speech Recognition System," vol. 12, pp. 89-92, 1987.
[2] E. P. Lewenstein and D. Musello, "His Master’s (Digital) Voice," Time, vol. 125, no. 13, pp. 83-84, 1
April 1985.
[3] M. Gales and S. Young, "The Application of Hidden Markov Models in Speech Recognition,"
Foundations and Trends in Signal Processing, vol. 1, no. 3, pp. 195-304, 2007.
[4] S. Levinson, "Continuous speech recognition by means of acoustic/ Phonetic classification obtained
from a hidden Markov mode," in IEEE International Conference on ICASSP, Acoustics, Speech, and
Signal Processing, 1987.
[5] L. Muda, M. Begam and I. Elamvazuthi, "Voice Recognition Algorithms using Mel Frequency Cepstral
Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques," JOURNAL OF COMPUTING, vol.
2, no. 3, March 2010.
[6] D. B. Paul, "Speech Recognition Using Hidden Markov Models," The Lincoln Laboratory Journal, vol.
3, no. 1, 1990.
[7] W. Ward, "Hidden Markov Models In Speech Recognition," Carnegie Mellon University, Pittsburgh.
[8] Eamonn J. Keogh and Michael J. Pazzani, "Derivative Dynamic Time Warping,," in Proceedings of the
2001 SIAM International Conference on Data Mining, 2001.

More Related Content

Similar to Imaging the human voice

Damage detection in cfrp plates by means of numerical modeling of lamb waves ...
Damage detection in cfrp plates by means of numerical modeling of lamb waves ...Damage detection in cfrp plates by means of numerical modeling of lamb waves ...
Damage detection in cfrp plates by means of numerical modeling of lamb waves ...
eSAT Journals
 
Ijri ece-01-02 image enhancement aided denoising using dual tree complex wave...
Ijri ece-01-02 image enhancement aided denoising using dual tree complex wave...Ijri ece-01-02 image enhancement aided denoising using dual tree complex wave...
Ijri ece-01-02 image enhancement aided denoising using dual tree complex wave...
Ijripublishers Ijri
 
Image Processing With Sampling and Noise Filtration in Image Reconigation Pr...
Image Processing With Sampling and Noise Filtration in Image  Reconigation Pr...Image Processing With Sampling and Noise Filtration in Image  Reconigation Pr...
Image Processing With Sampling and Noise Filtration in Image Reconigation Pr...
International Journal of Engineering Inventions www.ijeijournal.com
 
Ravasi_etal_EAGE2014
Ravasi_etal_EAGE2014Ravasi_etal_EAGE2014
Ravasi_etal_EAGE2014Matteo Ravasi
 
A Review on Image Denoising using Wavelet Transform
A Review on Image Denoising using Wavelet TransformA Review on Image Denoising using Wavelet Transform
A Review on Image Denoising using Wavelet Transform
ijsrd.com
 
ECEN+5264 TERM PAPER_Mithul Thanu
ECEN+5264 TERM PAPER_Mithul ThanuECEN+5264 TERM PAPER_Mithul Thanu
ECEN+5264 TERM PAPER_Mithul ThanuMithul Thanu
 
Ultrasonography
UltrasonographyUltrasonography
Ultrasonography
Lipikamandal3
 
Ijri ece-01-02 image enhancement aided denoising using dual tree complex wave...
Ijri ece-01-02 image enhancement aided denoising using dual tree complex wave...Ijri ece-01-02 image enhancement aided denoising using dual tree complex wave...
Ijri ece-01-02 image enhancement aided denoising using dual tree complex wave...
Ijripublishers Ijri
 
Wavelet neural network conjunction model in flow forecasting of subhimalayan ...
Wavelet neural network conjunction model in flow forecasting of subhimalayan ...Wavelet neural network conjunction model in flow forecasting of subhimalayan ...
Wavelet neural network conjunction model in flow forecasting of subhimalayan ...iaemedu
 
Sound Source Localization with microphone arrays
Sound Source Localization with microphone arraysSound Source Localization with microphone arrays
Sound Source Localization with microphone arrays
Ramin Anushiravani
 
Towards the identification of the primary particle nature by the radiodetecti...
Towards the identification of the primary particle nature by the radiodetecti...Towards the identification of the primary particle nature by the radiodetecti...
Towards the identification of the primary particle nature by the radiodetecti...
Ahmed Ammar Rebai PhD
 
Waveguide beamprop
Waveguide beampropWaveguide beamprop
Waveguide beampropeiacqer
 
Ill-posedness formulation of the emission source localization in the radio- d...
Ill-posedness formulation of the emission source localization in the radio- d...Ill-posedness formulation of the emission source localization in the radio- d...
Ill-posedness formulation of the emission source localization in the radio- d...
Ahmed Ammar Rebai PhD
 
K147897
K147897K147897
K147897
irjes
 
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionA New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
CSCJournals
 
Research: Applying Various DSP-Related Techniques for Robust Recognition of A...
Research: Applying Various DSP-Related Techniques for Robust Recognition of A...Research: Applying Various DSP-Related Techniques for Robust Recognition of A...
Research: Applying Various DSP-Related Techniques for Robust Recognition of A...
Roman Atachiants
 

Similar to Imaging the human voice (20)

fading-conf
fading-conffading-conf
fading-conf
 
Damage detection in cfrp plates by means of numerical modeling of lamb waves ...
Damage detection in cfrp plates by means of numerical modeling of lamb waves ...Damage detection in cfrp plates by means of numerical modeling of lamb waves ...
Damage detection in cfrp plates by means of numerical modeling of lamb waves ...
 
Chapter 4
Chapter 4Chapter 4
Chapter 4
 
FK_icassp_2014
FK_icassp_2014FK_icassp_2014
FK_icassp_2014
 
Ijri ece-01-02 image enhancement aided denoising using dual tree complex wave...
Ijri ece-01-02 image enhancement aided denoising using dual tree complex wave...Ijri ece-01-02 image enhancement aided denoising using dual tree complex wave...
Ijri ece-01-02 image enhancement aided denoising using dual tree complex wave...
 
Final document
Final documentFinal document
Final document
 
Image Processing With Sampling and Noise Filtration in Image Reconigation Pr...
Image Processing With Sampling and Noise Filtration in Image  Reconigation Pr...Image Processing With Sampling and Noise Filtration in Image  Reconigation Pr...
Image Processing With Sampling and Noise Filtration in Image Reconigation Pr...
 
Ravasi_etal_EAGE2014
Ravasi_etal_EAGE2014Ravasi_etal_EAGE2014
Ravasi_etal_EAGE2014
 
A Review on Image Denoising using Wavelet Transform
A Review on Image Denoising using Wavelet TransformA Review on Image Denoising using Wavelet Transform
A Review on Image Denoising using Wavelet Transform
 
ECEN+5264 TERM PAPER_Mithul Thanu
ECEN+5264 TERM PAPER_Mithul ThanuECEN+5264 TERM PAPER_Mithul Thanu
ECEN+5264 TERM PAPER_Mithul Thanu
 
Ultrasonography
UltrasonographyUltrasonography
Ultrasonography
 
Ijri ece-01-02 image enhancement aided denoising using dual tree complex wave...
Ijri ece-01-02 image enhancement aided denoising using dual tree complex wave...Ijri ece-01-02 image enhancement aided denoising using dual tree complex wave...
Ijri ece-01-02 image enhancement aided denoising using dual tree complex wave...
 
Wavelet neural network conjunction model in flow forecasting of subhimalayan ...
Wavelet neural network conjunction model in flow forecasting of subhimalayan ...Wavelet neural network conjunction model in flow forecasting of subhimalayan ...
Wavelet neural network conjunction model in flow forecasting of subhimalayan ...
 
Sound Source Localization with microphone arrays
Sound Source Localization with microphone arraysSound Source Localization with microphone arrays
Sound Source Localization with microphone arrays
 
Towards the identification of the primary particle nature by the radiodetecti...
Towards the identification of the primary particle nature by the radiodetecti...Towards the identification of the primary particle nature by the radiodetecti...
Towards the identification of the primary particle nature by the radiodetecti...
 
Waveguide beamprop
Waveguide beampropWaveguide beamprop
Waveguide beamprop
 
Ill-posedness formulation of the emission source localization in the radio- d...
Ill-posedness formulation of the emission source localization in the radio- d...Ill-posedness formulation of the emission source localization in the radio- d...
Ill-posedness formulation of the emission source localization in the radio- d...
 
K147897
K147897K147897
K147897
 
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionA New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
 
Research: Applying Various DSP-Related Techniques for Robust Recognition of A...
Research: Applying Various DSP-Related Techniques for Robust Recognition of A...Research: Applying Various DSP-Related Techniques for Robust Recognition of A...
Research: Applying Various DSP-Related Techniques for Robust Recognition of A...
 

Recently uploaded

extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
plant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptxplant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptx
yusufzako14
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
AlguinaldoKong
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
muralinath2
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ossaicprecious19
 
Large scale production of streptomycin.pptx
Large scale production of streptomycin.pptxLarge scale production of streptomycin.pptx
Large scale production of streptomycin.pptx
Cherry
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
IvanMallco1
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
anitaento25
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
Richard Gill
 
Viksit bharat till 2047 India@2047.pptx
Viksit bharat till 2047  India@2047.pptxViksit bharat till 2047  India@2047.pptx
Viksit bharat till 2047 India@2047.pptx
rakeshsharma20142015
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
AADYARAJPANDEY1
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
subedisuryaofficial
 

Recently uploaded (20)

extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
plant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptxplant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptx
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
Large scale production of streptomycin.pptx
Large scale production of streptomycin.pptxLarge scale production of streptomycin.pptx
Large scale production of streptomycin.pptx
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
 
Viksit bharat till 2047 India@2047.pptx
Viksit bharat till 2047  India@2047.pptxViksit bharat till 2047  India@2047.pptx
Viksit bharat till 2047 India@2047.pptx
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
 

Imaging the human voice

  • 1. Page 1 of 8 Imaging the Human Voice As a Three Dimensional Surface Ebe Helm ABSTRACT Nature often hides its most interesting qualities and patterns until curiosity or imagination conceives new ways to discover and visualize them. The Nautilus shell and the Barnsley fern are just two examples of this. The possibility that human speech, if viewed in new ways, and in the expanded perspective of a three dimensional surface might also exhibit such patterns was the basis for this investigation. To a more practical application the goal was to determine if a new approach might yield such observable patterns as would expand the science of computer speech recognition, and in such a way as would make the topic more approachable to a general audience. Introduction This paper is a follow-on effort to an earlier work that focused primarily on examining acoustic waveforms in the more traditional linear time domain, and attempted to expand on some of the challenges encountered in computer speech recognition [1]. A function for separating the waveforms without the application of Discrete Fourier Transforms or the use of Mel Frequency Cepstral Coefficients, as well as the considerations of Dynamic Time Warping were demonstrated. This presentation continues from that point by providing additional perspective beyond two dimensions. The goal of the application was founded in two questions. What does a word actually look like? Is it possible to associate the meaning of a sound by its appearance [2]. The questions become more complex because the perceptions are governed by the methods used to render the images. Traditionally, and more commonly encountered, are both the linear time-line and the two-dimensional Sonographs as shown below. This initial hypothesis required a means for transforming coordinate data from a form of ordered triples into an ordered pair such that these could be displayed and rotated on a virtual plane. While this was accomplished, it was a remarkable experience to observe that when the acoustic waveforms were processed and displayed, they exhibited an inherently three-dimensional quality all their own. It was not necessary to force the perspective of three dimensions. It appears that the quality was already there. The following five sections demonstrate techniques for processing complex audio waveforms providing: 1) separation of the waveform into individual lines of constituent frequencies. Including the Fundamental [sometimes known as the Glottal] frequency. 2) Generating waveform profiles that may be viewed and rotated as a three dimensional surface using co-ordinate transformation. 3) Examination of waveforms across long and short intervals of the time domain. 4) A description and illustration comparing the frequency profile across a variable time domain and how this profile might yield unique and recognizable patterns to individual phonetic structures. 5) Finally, a means for reducing the background noise floor and separating individual words and consonants is illustrated. Linear waveform Two dimensional Spectragram The more traditional examples of linear waveform rendering (left) and a two-dimensional Spectrogram sometimes known as Sonograms, Sonographs, voice prints, or spectrographs shown (right).
  • 2. Page 2 of 8 I. Waveform Filtering Fast Fourier Transforms (FFT), Discrete Fourier Transforms (DFT), and Mel Frequency Cepstral Coefficients (MFCC) have traditionally been the chosen means for separating complex waveforms into their constituent frequencies [3] [4] [5] [6] [7]. One of the first objectives in this effort was to find an alternative that would accomplish essentially the same thing. Perhaps by smoothing away the higher frequencies from the lower frequencies, and revealing detail that is otherwise occluded. The desired result was observed by taking the average of f(x) and the two point’s f(x-n) and f(x+n) on either side. On the first iteration of this function the waveform immediately displayed the higher frequencies and lower energies of the consonants |t| and |th| as shown in figure 1a below. These energies are normally all but indiscernible when viewed as a composite two dimensional waveform. As the number of iterations of this function increased, the waveforms smoothed out to reveal the lower frequency and higher energies until the fundamental frequency itself came into relief. Figure 1b. With each following iteration, the values of n are increased outward on either side of f(x) in what might be referred to as an expanding average. The results however, are not immediately applied to the line f(x). They are kept in a buffer such as not to effect the following iterations of f(x). In this way the algorithm might be described as semi-recursive. Only after all values in the line f(x) have been calculated, are they moved to become g(x). This is not necessarily implicit in the equation below, but does have significant effect on the resultant data. Initially, a significant amount of noise was observed with each line iteration. A variety of techniques and functions were explored to filter and remove this noise from the signal with varying degrees of success. Ultimately it was found that by simply subtracting the previous line from the current line, the noise was effectively removed. This was increasingly evident with regards to the lower frequency noise. As a side effect, the overall waveform demonstrated a more accurately defined shape and envelope. 𝑊𝑎𝑣𝑒𝑓𝑜𝑟𝑚 𝐹𝑖𝑙𝑡𝑒𝑟𝑖𝑛𝑔 → 𝑔(𝑥)∘𝑖 = ( 𝑓(𝑥 − 𝑛) + 𝑓( 𝑥) + 𝑓(𝑥 + 𝑛) 3 )  ℎ ∘ 𝑔(𝑥)∘𝑖 = 𝑔(𝑥)∘𝑖 − 𝑔(𝑥)∘𝑖−1 The depth and distribution of the waveform on a two dimensional plane, from highest to lowest frequencies, is also effected by the rate of the increasing values of the two elements –n and +n from f(x). Treated as a non-linear function, it is possible to effect and scale the distribution of the overall envelope of the waveform across the 100 wave-lines of frequencies resolved. As illustrated in section II following below. In this case 100 lines: Where → i = 1 to 100 lines and n = i + i⁄2. The results of frequency filtering are shown above. With the first iteration, the highest frequencies come into relief showing the otherwise less apparent lower energy of |t| in “Testing” and “Two” and |th| in the word “Three”. Figure 1a. The fundamental frequency of 147Hz is shown at 50 iterations. Figure 1b. The significance of the range in time domain is illustrated here. The higher frequencies relating to soft pallet sounds are easily seen with no magnification at all. While the glottal sounds and fundamental frequency are more apparent at a 75 millisecond window. figure 1a: n ≈ 1 iterations. 400:1 compresion 2000 milliseconds. figure 1b: n ≈ 50 iterations. 15:1 compresion 75 milliseconds.
  • 3. Page 3 of 8 II. Waveform Profile These complex waveforms must be examined across a broad range of magnification in the time domain. From seconds to milliseconds. In the longer time period views the signal is better perceived by looking at it as a profile. This profile can be obtained with an arithmetic mean function as shown below. 𝑊𝑎𝑣𝑒𝑓𝑜𝑟𝑚 𝑃𝑟𝑜𝑓𝑖𝑙𝑒 → 1 𝑛 ∑ |𝑓(𝑥)| 𝑛+𝑚 𝑥=𝑚 𝑛 = (𝑠𝑎𝑚𝑝𝑙𝑒𝑠/𝑠𝑒𝑐)/100 𝑚 = −𝑛 2 ⁄ The process of averaging all this data was found to be processor intensive and slowed rendering of the waveforms considerably. To reduce the rendering time, the function value calculated at a given point is copied into a range of following elements. The next calculation point is then advanced by that number of elements. This effectively reduces the number of function iterations. This is not implicit in the ‘Waveform Profile’ equation above, but rather managed in the program code. The effect is a significantly increased rendering speed with acceptable results of imaging the waveform as shown in figure 2b below. figure 2a figure 2b One of the most interesting observations was made when all 100 data lines were stacked one atop another and displayed simultaneously. The resemblance to a two-dimensional sonograph, like the one shown on page 1, was immediately evident as in figure 2c. Note: The amplitude of each individual line is scaled to a min/max of 0-100 for the purposes of calculations and display. This is what makes bringing the higher frequency (lower energy) components into relief possible. The soft pallet sounds of |t| and |th| are not normally visible in the composite waveform, but clearly standout with a topographical quality when processed in this way. figure 2c figure 2d In extending the results of the Waveform Profile technique to view to the entire array of lines, the image began to display an interesting degree of detail. This in contrast to the fact that the resolution of the data [in the profile view] had actually been smoothed away. It was at this point that the [naturally occurring] three dimensional quality of these waveforms first resolved. It was only on magnification to the millisecond level where the same quality became dramatically apparent with the raw data.
  • 4. Page 4 of 8 III. Coordinate Transformations Initially the thought was that recognizing human speech patterns might be possible by bringing their phonetic patterns into relief as a three dimensional surface. These structures, impossible to see in linear plots, and barely discernable in two-dimensional spectrographs might possibly become apparent if viewed in this way. The first goal was to find or develop a means of coordinate transformation of the ordered triples (x, y, z) into ordered pairs (x’, y’) such that they could be rotated on a virtual plane. On this surface x would remain the time domain, while z would extend along the range of the individual frequencies derived, and y would continue to represent the amplitude of values of the signal f(x). These could then be displayed on a computer screen. A modification of the two-dimensional trigonometric identity for the addition of angles, and subsequently the addition of a second angle of rotation provided the desired result for creating a rotational plane. To visualize the concept, hold a cylindrical object, perhaps a drinking glass. Imagine the rim of this glass existing only on a two-dimensional plane. Viewing the glass in this way, its rim should first appear as a straight line. Now tilt this object forward about its x-axis and observe the rim transforming into an ellipse. Finally, as the glass continues to tilt about x, the rim becomes a circle. Notice also that as the glass is tilted, the rim also translates down along the y-axis. The final modification to this identity provided for the effect of an ordered triple to represent a point on orbit about the origin of a plane laid tangent onto the surface of a sphere. Figure 3a 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 → 𝑥′ = 𝑥𝐶𝑜𝑠𝜃 − 𝑧𝑆𝑖𝑛𝜃 𝑎𝑛𝑑 𝑦′ = 𝑆𝑖𝑛𝜙(𝑥𝑆𝑖𝑛𝜃 + 𝑧𝐶𝑜𝑠𝜃) + 𝑦𝐶𝑜𝑠𝜙 It may be beneficial to consider an analogy. The antiquated NTSC television transmission system might be described as one of the most complex analog encoding systems evolved before the advent of digital transmission. In this system, specialized oscilloscopes are used for observing the various aspects of the composited signal. The frame rate at fractions of a second. The line rate in microseconds. Finally, the color subcarrier is observed and measured in nanoseconds. It is of course impossible to visualize all aspects of this signal at the same time. They must be observed in incremental steps as one would first use the unaided eye, then a magnifying glass, and finally a microscope. The analogy is relevant here because, like the old TV broadcasts, recognizable components of a spoken word also appear to exist over a wide range in the time domain. Hence the need of being able to zoom in and out from whole seconds to milliseconds becomes even more important.
  • 5. Page 5 of 8 Of particular import also, is that the data is not only being shown over a broadly varying range of time, but also that it is being represented as both a profile as in figure 3a, and as the raw data again as in figure 3b. The overall structure of the words is more easily seen in the long time domain if shown as a smoothed profile, however the raw data itself requires no smoothing when zoomed in to short time intervals. Figure 3b In figure 3a the high frequency energies of |t| and |th| are easily seen at the top of the plot, while the more subtle phonetic structures and harmonics of consonants and vowels such as |a| and |ah| are examined at magnifications of twenty five milliseconds shown in figure 3b. It is necessary to view the patterns of the spoken word both near and far to comprehend the relevant features that make a word unique. It may be interesting to note that two seconds of recorded speech shown in figure 3a requires a compression of 400:1. Were the sample expanded to reveal its full detail at 1:1 compression, the data would require a computer screen seventy five feet wide. The contrast is important here to further demonstrate the diversity and scope to which the unique patterns of a word extend. While the higher frequency soft pallet sounds are easily seen in real time, the subtle differences in vowels require closer examination. Having successfully rendered waveforms from a virtual 3D plane onto the 2D display, the need for accurately tracking a cursor across the two planes for measurments and selections becomes evident. In effect, it is nessesary to reverse the coordinate transformation between the two planes. The ordered pair of (x, y) represent the mouse pointer coordiantes on the computer display and are transfomed to the virtual plane as (x’, y’) as shown in the ‘Surface cursor lines’ equation below. This allows the mouse pointer to more accurately relate position and track across the virtual 3D plane. Note: n in y’ compensates for a 10:1 ratio between 1000 data points across 100 data lines. In this case the value of n = 10. 𝑆𝑢𝑟𝑓𝑎𝑐𝑒 𝑐𝑢𝑟𝑠𝑜𝑟 𝑙𝑖𝑛𝑒𝑠 → 𝑥′ = ( 𝑦 𝑆𝑖𝑛𝜙 ) 𝑆𝑖𝑛𝜃 + 𝑥𝐶𝑜𝑠𝜃 𝑎𝑛𝑑 𝑦′ = ( 𝑦 𝑆𝑖𝑛𝜙/𝑛 ) 𝐶𝑜𝑠𝜃 – 𝑥𝑆𝑖𝑛𝜃 While it was discovered that the data had a naturally occurring three dimensional quality all its own, there was still a distinct advantage to being able to examine the images from a continuously variable perspective. Nuances of form and shape that would otherwise be unnoticed became more evident.
  • 6. Page 6 of 8 IV. Spectrum sampling The symmetry of harmonics in the waveform may be brought into greater contrast by removing the negative going values in a way analogous to Nyquist filtering. One of the first observations in doing this was that the positive and negative sides of the waveforms are not symmetrical. It appears that both positive and negative going aspects of the waveform could be relevant for discriminating patterns for recognition. Likewise the details and subtleties of spectrum of these waveforms across the 100 data lines are also brought into greater relief when observed in varying three-dimensional perspectives. Figures 4a and 4b below. Figure 4a Figure 4b As can be seen in these illustrations, the fundamental frequency is clearly evident, but more importantly a spectral profile at any given instant across the time domain is now also shown. The need for techniques such as ‘Dynamic Time Warping’ [5] [8] to fit and match waveform patterns is not neccessary, as it does not matter where these patterns occur in the time domain. Only that they do occur relative to the fundamental. Figures 4c and 4d below. figure 4c figure 4d The ability to observe these systems in three dimensions affords a unique visual perspective for the two components of amplitude and frequency. In addition to this, another perspective is made possible by observing a sampling of both amplitude and frequency for a chosen time period. As a starting point, the fundamental frequency is a key reference. The frequency/amplitude profiles are then summed to form a two dimensional pattern. The visulaizations obtained may suggest the potential for recognizable patterns. The overall shape of these patterns may in fact be unique and recognizable while at the same time independent of amplitude, frequency and time. An example of this concept is shown below as a period is taken from a point on the fundamental zero crossing at -/4 to /4 Figures 4e and 4f figur 4e figure 4f
  • 7. Page 7 of 8 V. Noise Floor Suppression The general purpose of the Wavescope platform was to provide as limitless an environment as possible for exploring techniques in manipulating waveform data. As an example. Noise floor suppression continues to be an important subject for improving audio quality of both speech and music. It also relates to the quandary of separating what is not wanted from what is wanted. Applied to computer speech recognition, removing the noise floor from a signal also provides separation of elements allowing for pattern recognition. Separating ‘connected speech’ has long been one of the greatest challenges in computer speech recognition. This concept also extends to the goal of separation and isolation of individual consonants and vowels. 𝑁𝑜𝑖𝑠𝑒 𝐹𝑙𝑜𝑜𝑟 𝑆𝑢𝑝𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 → ℎ(𝑥) = 𝑔 ∘ 𝑓(𝑥) (1 − ( 𝑔 ∘ 𝑓(𝑥) 𝑚𝑎𝑥 − 𝑔 ∘ 𝑓(𝑥) 𝑔 ∘ 𝑓(𝑥) 𝑚𝑎𝑥 )) The equation above provides signal attenuation inverse to the amplitude of g f(x). As the signal level increases, the amount of attenuation decreases. This demonstrates eliminating lower level noise while affecting the desired signal in an increasingly lesser degree as the amplitude increases. The important distinction in this case is that this attenuation can now be applied to each of the 100 wave-data lines individually. The ability to discreetly filter each line provides for a more exact and discriminating result. The example above was inspired from the more commonly known techniques of μ-law and A-law compression and expansion algorithms used in telecommunications to limit bandwidth. figure 5a figure 5b Shown in figures 5a and 5b, the filter is applied to the frequency lines individually as opposed to applying the function to the unprocessed composite waveform as a whole. As a result, the effectiveness of the filter appears to increase as its application is more selective. Separation of consonants and vowels begin to come into relief, and the |t| and |th| and |s| sounds are shown more clearly separated from the lower frequencies.. Conclusion Computer speech recognition has been attained and mostly perfected. What has not been perfected is a general accessibility and understanding of the subject. It remains one of both esoteric obscurity and significantly advanced mathematics to those wishing to explore, or expand the science with new concepts. Perhaps a next logical effort might be to explore pattern matching techniques using the spectrum profiles found using these or similar techniques. As described earlier, this could prove an effective means of removing the challenges related to Dynamic Time Warping, as matching these profiles to sample patterns would not be subject to alignment in the time domain. The zero-crossings of the fundamental frequency might be used as a reference for selecting and extracting the spectral profile of a specific period in time. Finally, the term ‘Wavescope’ was given to this program as a descriptive akin to a telescope or microscope. What might be seen is usually entirely unknown until the thing is built and one looks through it. That was the purpose of the program. To see what has never been seen. Perhaps in a way that it has never been seen before. It is intended, and hoped that the techniques and equations presented here would prove sufficient to reproduce the results shown by anyone wishing to do so.
  • 8. Page 8 of 8 References [1] Y. Chow, M. Dunham, O. Kimball, M. Krasner, Kubala, J. G. Makhoul, S. Price, S. Roucos and R. M. Schwarz, "BYBLOS: The BBN Continuous Speech Recognition System," vol. 12, pp. 89-92, 1987. [2] E. P. Lewenstein and D. Musello, "His Master’s (Digital) Voice," Time, vol. 125, no. 13, pp. 83-84, 1 April 1985. [3] M. Gales and S. Young, "The Application of Hidden Markov Models in Speech Recognition," Foundations and Trends in Signal Processing, vol. 1, no. 3, pp. 195-304, 2007. [4] S. Levinson, "Continuous speech recognition by means of acoustic/ Phonetic classification obtained from a hidden Markov mode," in IEEE International Conference on ICASSP, Acoustics, Speech, and Signal Processing, 1987. [5] L. Muda, M. Begam and I. Elamvazuthi, "Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques," JOURNAL OF COMPUTING, vol. 2, no. 3, March 2010. [6] D. B. Paul, "Speech Recognition Using Hidden Markov Models," The Lincoln Laboratory Journal, vol. 3, no. 1, 1990. [7] W. Ward, "Hidden Markov Models In Speech Recognition," Carnegie Mellon University, Pittsburgh. [8] Eamonn J. Keogh and Michael J. Pazzani, "Derivative Dynamic Time Warping,," in Proceedings of the 2001 SIAM International Conference on Data Mining, 2001.