An Introduction to Various Features of Speech SignalSpeech features

This Tutorial is Downloaded from: https://sites.google.com/site/enggprojectece
An Introduction to Various Features of Speech Signal
Compiled by: Sivaranjan Goswami, Pursuing M. Tech. (2013-15 batch)
Dept. of ECE, Gauhati University, Guwahati, India
Contact: sivgos@gmail.com
Speech is the most fundamental mode of communication among human beings as well as many
other creatures. In our day to day life we always communicate through speech. In the last few decades
a large number of researches have been undergone to make use of speech to control various electronic
systems. Speech has a number of advantages over hand control through panel and switches since
speech can be easily transmitted over telephone channel and hence remote controlling of devices
become easier using speech.
The audibility range of human ear is 20Hz to 2 kHz. However, the frequency of human speech
varies from 300 Hz to 3400 Hz. Thus according to Nyquist theorem, the sampling rate should be
greater than or equal to 6800 Hz. In telecommunication, the sampling rate is considered to be 8 kHz.
Therefore, Analog-to-Digital converters of mobile phones, sample the signal at a sampling rate of 8
kHz. However, for multimedia applications, the sampling rate is usually much higher. In MP3 songs
that we download, the sampling rate is usually 44100 Hz. This is the reason, why the quality of sound
recorded using a mobile phone is very poor compared to MP3 songs that we download.
1
A. What is Speech Signal
Speech or any sound is basically an acoustic signal that travels through air or any other material
through expansion and compression of the particles. It is hence a pressure wave. A microphone is a
transducer that converts this pressure wave into a voltage signal.
A detailed description of the human speech generation system is beyond the scope of this discussion.
However a brief discussion which is inevitable in the context of feature extraction is presented. The
human speech production system is a complex mechanical system. The air exhaled by the lungs is
modulated by various hard and soft tissues initially by the glottal fold and then by the tissues of the
vocal tract such as tongue, lips, jaw, and velum. In Digital Speech Processing, this process is
represented as a discrete time model as shown in figure 1. The system containing the lungs and the
glottal fold comes in the block Excitation Generator. The vocal tract is modeled as a linear system,
which is usually a digital FIR filter. The vocal tract parameters are the parameters of the digital filter.
Fig. 1. Block diagram of speech generation model

Based on the type of the excitation signal, a speech signal can be classified into two major types:
1. Voiced Speech: Voiced sounds are produced by forcing air through the glottis or an opening
between the vocal folds. The excitation is a quasi-stationary impulse train, that is, a signal
whose frequency remain constant for a small amount of time, sometimes referred to as the
stationarity period. Example of voiced speech are vowel sounds as in cat, hear, too etc.
2. Unvoiced Speech: Unvoiced sounds are generated by forming a constriction at some point
along the vocal tract, and forcing air through the constriction to produce turbulence. The
excitation is a random signal. It can be modeled as a White Gaussian Noise. These are
consonant sounds as in ship, key etc.
It can be said that the voiced component of a word is responsible for its tone or shape of the waveform
of the word, whereas the unvoiced section carries the actual meaning. The two waveforms below
correspond to the words CUP and DUCK. Their voiced part is similar so the waveforms are also
similar.
As shown in figure 1 the speech production mechanism is modeled as a cascade combination of an
excitation generator and a digital filter. The excitation of the filter determines the type of speech and
the digital filter simulates the effect of various organs or tissues of the vocal tract on the excitation.
The parameters of the filter are known as vocal tract parameters. The excitation is either an impulse
train or a random noise based on whether the speech is voiced or unvoiced respectively. Thus figure 1
can be drawn as figure 2.
2

Figure 2: Block diagram of speech generation model for Linear Predictive Analysis
3
B. Short-Time Analysis of Speech Signal
From the above discussion we have seen that the properties of a speech signal remain same only
for a short duration of time. Therefore, any kind of speech processing first requires segmentation of
the speech signal into frames of short duration. The duration for which the properties of a speech
signal remains stationary varies from speaker to speaker. It usually ranges from 15 to 25 milliseconds.
It is a common practice to take the range as 20 milliseconds. If the speech signal is sampled at a rate
of 8 kHz, it implies that there will are 160 samples per frame.
Sometimes to overcome certain difficulties particular to some problem, the speech frames are
overlapped or multiplied with some window function. Such cases are not covered in this tutorial. Such
cases are discussed on the tutorial on “Short Term Spectral and Cepstral Analysis of Speech
Signal”.
C. Features of Speech Signal
Till now we had a brief introduction to the generation and types of speech signal. Now we will come
to feature extractions.
1. Zero-Crossing Rate:
Zero-crossing rate is a measure of frequency of the signal over a small period. It can be
obtained by measuring the number of times the sign of the signal changes and dividing it by
two.

4
Figure 3: Zero crossing
It can be seen that during one period, the signal crosses zero twice. Thus for any frame, the zero-crossing
rate (ZCR) is given by:
ܼܥܴ =
ܰ݋. ݋݂ ܵ݅݃݊ ܥℎܽ݊݃݁ݏ ݅݊ ݐℎ݁ ݂ݎܽ݉݁
ܨݎܽ݉݁ ݀ݑݎܽݐ݅݋݊
(ݏ݁ܿିଵ)
2. Mean Square or Mean Magnitude value:
This is a mean value of the signal for a particular frame ignoring the sign. The mean square value of
the k-th frame is given by:
ܲ௔௩௚(݇) =
1
ܰ
ே௞ାேିଵ
෍ ݔଶ(݊)
௡ୀே௞
; ݇ = 0,1,2,…,
ܮ − 1
ܰ
Similarly, the mean magnitude is given by:
ܣ௔௩௚(݇) =
1
ܰ
ே௞ାேିଵ
෍ |ݔ(݊)|
௡ୀே௞
; ݇ = 0,1,2,…,
ܮ − 1
ܰ
Where, L is the total number of samples in a given audio clip.
Both mean square and mean magnitude carries information about the short time energy of the signal.
If the magnitude of the signal is normalized in the range [-1, 1], then the range of mean square value
and mean magnitude value are also same [0, 1]. Usually a selection between these two parameters is
done to determine a suitable threshold value. In case of mean square value, sometimes, it is easier to
select a threshold value during some operation.
In theory books, usually, these equations are written using sliding window. In this introductory
tutorial I avoid that notation as the present notation is easier for implementing in a computer program.
3. Voice Activity Detection

We don’t speak continuously. During our speech, there are many pauses and breaks. To perform any
speech processing, it is necessary to distinguish between presence and absence of speech in an audio
clip. Presence or absence of speech in a short-duration frame can be easily determined if there is no
background noise. If speech is not present, the mean magnitude value or mean square value is very
small. On the other hand, a high value of mean magnitude or mean square value indicates presence of
speech. If there is background noise, it is a challenging task to determine voice activity. Many
literatures have been published for detection of voice activity in presence of background noise.
4. Detection of Voiced and Unvoiced Speech
Voiced and unvoiced speeches are already introduced. It is relevant to mention here that most of the
features of a speech signal are extracted for voiced speech. Hence identification of voiced and
unvoiced speech is another important task after voice activity detection.
It is to be noted that, for voiced speech, the mean square value is large, whereas the zero-crossing rate
is small. On the other hand, the zero-crossing rate for unvoiced speech is large and the average
magnitude is very small.
Detection of voiced and unvoiced speech is also a challenging task in presence of background noise.
Usually voiced speech is somewhat easy to distinguish if background noise is stationary; however, the
unvoiced speech is difficult. In this field also a number of literatures have been published.
5
5. Pitch and Pitch Period Estimation
Pitch is the perceived fundamental frequency of musical note or voiced speech. It may not be same as
the actual fundamental frequency of the speech signal. However, in many literatures, the terms pitch
and fundamental frequency are used interchangeably. Pitch period is the fundamental period of voiced
speech. Pitch estimation is a great challenge.
Pitch is one of the most important parameters that are required for high level speech processing like
speech recognition, speaker recognition etc. Everyone has a pitch range to which he or she is
constrained by simple physics of his or her larynx. For men, the possible pitch range is usually found
somewhere between the two bounds 50-250 Hz, while for women the range usually falls somewhere
in the interval 120-500 Hz. Everyone has a "habitual pitch level," which is a sort of "preferred" pitch
that will be used naturally on the average. Pitch is shifted up and down in speaking in response to
factors relating to stress, intonation, and emotion. Stress* refers to a change in fundamental frequency
and loudness to signify a change in emphasis of a syllable, word, or phrase. Intonation* is associated
with the pitch contour over time and performs several functions in a language, the most important
being to signal grammatical structure. The markings of sentence, clause, and other boundaries are
accomplished through intonation patterns.
There are many of literatures on pitch estimation techniques published in various journals and
conferences worldwide. However, a classical approach using cepstral analysis has been discussed on
the tutorial on “Short Term Spectral and Cepstral Analysis of Speech Signal”.
*The terms stress and intonation are explained bellow at feature no.7

6. Phonemics and Phonetics of Speech Signal
Phonemes are the basic theoretical unit of speech. Each phoneme can be considered to be a code that
consists of a unique set of articulatory gestures. In English there are about 42 phonemes. Due to many
different factors including, for example, accents, gender, and, most importantly, coarticulatory effects,
a given "phoneme" will have a variety of acoustic manifestations in the course of flowing speech.
Therefore, any acoustic utterance that is clearly "supposed to be" that ideal phoneme, would be
labeled as that phoneme. The phonemes of a language, therefore, comprise a minimal theoretical set
of units that are sufficient to convey all meaning in the language.
One common approach of speech recognition is to segment and distinguish the phonemes from
phones (the sound produced in speaking).
The study of the abstract units (phonemes) and their relationships in a language is called phonemics,
while the study of the actual sounds of the language is called phonetics. More specifically, there are
three branches of phonetics each of which approaches the subject somewhat differently:
(a) Articulatory phonetics is concerned with the manner in which speech sounds are produced by
6
the articulators of the vocal system.
(b) Acoustic phonetics studies the sounds of speech through analysis of the acoustic waveform.
(c) Auditory phonetics studies the perceptual response to speech sounds as reflected in listener
trials.
In speech recognition systems or speech to text-conversion systems, a corpus is to be made that
contains the whole set of phonemes of a particular language and corresponding letters or meanings. In
languages like English, the same phoneme may correspond to a number of letters as there is no one-to-
one correspondence between sounds and letters. In languages like Hindi or Assamese, it is
somewhat simpler. But all languages have their own challenges.
When a test speech is input to the system, it segments the speech into segments and identifies the
corresponding phoneme in the corpus using some suitable algorithm such as Dynamic Time Wrapping
(DTW) etc. The succeeding and preceding phoneme helps to resolve ambiguity if a phoneme
corresponds to more than one letter. These systems are really complex and beyond the scope of this
basic tutorial.
7. Prosodic Features: Stress and Intonation of Speech
The tonal and rhythmic aspects of speech are generally called prosodic features. These features have
significant contributions to the formal linguistic structure of a language. These features extend over
more than one phoneme; therefore such features are also known as suprasegmental.
Prosodic features are created by certain special manipulations of the speech production system during
the normal sequence of phoneme production. These manipulations are categorized as either source
factors or vocal-tract shaping factors. The source factors are based on subtle changes in the speech
breathing muscles and vocal folds, while the vocal-tract shaping factors operate via movements of the
upper articulators. The acoustic patterns of prosodic features are heard in systematic changes in
duration, intensity, fundamental frequency, and spectral patterns of the individual phonemes.
Stress and intonation are most important prosodic features of speech signal. Stress refers to a change
in fundamental frequency and loudness to signify a change in emphasis of a syllable, word, or phrase.
Intonation is associated with the pitch contour over time and performs several functions in a language,
the most important being to signal grammatical structure. The marking of sentence, clause, and other
boundaries is accomplished through intonation patterns.

Stress is used to distinguish similar phonetic sequences or to highlight a syllable or word against a
background of unstressed syllables. For example, consider the two phrases "That is insight" and "That
is in sight." In the first phrase there is stress on "in" but "sight" is unstressed, while the converse is
true in the second phrase.
Extraction of features like stress or intonation can be performed using pattern recognition based
approach using various methods.
*Prosodic feature extraction and speech recognition need very in-depth study of the subject.
Here I have given only a hint of such features.
Suggested book: “Discrete Time Processing of Speech Signal” by John R. Deller, John H. L.
Hansen and John G. Prokais.
References:
[1] Lawrence R. Rabiner and Ronald W. Schafer, “Introduction to Digital Speech Processing”, now
Publishers Inc.
[2] John R. Deller, John H. L. Hansen and John G. Prokais, “Discrete Time Processing of Speech
Signal”, The Instituteof Electrical and Electronics Engineers (IEEE), lnc.,NewYork.
*Download links of both of these two books are available at the website.
7

An Introduction to Various Features of Speech SignalSpeech features

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to An Introduction to Various Features of Speech SignalSpeech features

Similar to An Introduction to Various Features of Speech SignalSpeech features (20)

More from Sivaranjan Goswami

More from Sivaranjan Goswami (6)

Recently uploaded

Recently uploaded (20)

An Introduction to Various Features of Speech SignalSpeech features