2. Content
• Speech Signal
a) Introduction
b) Speech Production & Perception
c) Sampling Theorem
d) Need for Short Term Processing
e) Fundamental Frequency
f) Zero-Crossing Rate
g) Short Term Energy
h) Spectrogram
• Librosa Library
2
D r . S h i k h a B a g h e l , I I S c B e n g a l u r u
3. Speech Signal: An introduction
D r . S h i k h a B a g h e l , I I S c , B e n g a l u r u 3
A primary medium for our day-
to-day life communication.
Applications
Speech recognition
Speech coding
Speech synthesis (Text to speech
conversion)
Speaker verification / recognition
Speech enhancement
Aids to the Handicaped
Biomedical Applications
Image Credit: https://www.vectorstock.com/royalty-free-vector/bubble-people-
bubbling-speech-communication-vector-25841557
4. Speech Production & Perception
4
D r . S h i k h a B a g h e l , I I S c , B e n g a l u r u
5. Speech Production
• Speech signal is composed of a sequence of sound units (or phonemes).
• Sound unit production:
5
D r . S h i k h a B a g h e l , I I S c , B e n g a l u r u
Video
Credit: https://www.youtube.com/watch?v=JF8rlKuSoFM
6. Sampling Theorem
6
D r . S h i k h a B a g h e l , I I S c , B e n g a l u r u
Image Credit: https://www.tutorialspoint.com/signals_and_systems/signals_sampling_theorem.htm
Sampling Rate: Number of samples per second
Sampling frequency (fs) ≥ 2 × Maximum frequency
(fm)
7. Need for Short Term Processing of
Speech
• Speech is produced from a time varying vocal tract system with time varying
excitation.
• Speech signal is non-stationary in nature.
• Most of the signal processing tools studied in signals and systems and signal processing
assume time invariant system and time invariant excitation, i.e., stationary signal.
• Hence these tools are not directly applicable for speech processing.
• Speech signal may be stationary when it is viewed in blocks of 10-30 msec.
• Hence to process speech by different signal processing tools, it is viewed in terms of 10-
30 msec. Such a processing is termed as Short-Term Processing (STP).
D R . S H I K H A B A G H E L , I I S C B E N G A L U R U 7
8. Audio File format
• .mp3
Lossy format. It compresses the data. Essential information might be lost.
• .flac
It also compresses the data, but original signal can be reconstructed perfectly.
• .wav
An uncompressed format. The best audio quality, but the file size is largest.
D R . S H I K H A B A G H E L , I I S C B E N G A L U R U 8
9. Windowing
D R . S H I K H A B A G H E L , I I S C B E N G A L U R U 9
Frame size: 25 ms and frame shift: 10 ms
10. Audio as a function of time and Frequency
D R . S H I K H A B A G H E L , I I S C B E N G A L U R U 1 0
https://towardsdatascience.com/extract-features-of-music-75a3f9bc265d
11. Fundamental Frequency
(F0)
• Rate at which vocal-folds
vibrates.
• Fundamental Frequency (F0)
= 1/ time taken to complete
one vocal-fold vibration
D r . S h i k h a B a g h e l , I I S c , B e n g a l u r u 1 1
Video Credit: https://youtu.be/mJedwz_r2Pc
Image Credit: https://wiki.aalto.fi/pages/viewpage.action?pageId=149890776
12. Zero-Crossing Rate
• The zero-crossing rate is the rate of sign-changes along a signal, i.e., the rate at which
the signal changes from positive to negative or back.
• This feature has been used heavily in both speech recognition and music information
retrieval.
• It usually has higher values for highly percussive sounds like those in metal and rock.
D R . S H I K H A B A G H E L , I I S C B E N G A L U R U 1 2
https://www.analyticsvidhya.com/blog/2022/01/analysis-of-zero-crossing-rates-of-different-music-
genre-tracks/
14. Short Term Energy
• The energy associated with speech is time varying in nature.
• By the nature of production, the speech signal consist of voiced, unvoiced and silence
regions.
• Further the energy associated with voiced region is large compared to unvoiced region
and silence region will not have least or negligible energy.
• Thus short term energy can be used for voiced, unvoiced and silence classification of
speech.
D R . S H I K H A B A G H E L , I I S C B E N G A L U R U 1 4
15. Short Term Energy
D R . S H I K H A B A G H E L , I I S C B E N G A L U R U 1 5
16. Spectrogram
• A spectrogram is a visual representation of the spectrum of frequencies of sound or
other signals as they vary with time.
• It’s a representation of frequencies changing with respect to time for given music
signals.
D R . S H I K H A B A G H E L , I I S C B E N G A L U R U 1 6
https://towardsdatascience.com/understanding-audio-data-fourier-transform-fft-spectrogram-and-
speech-recognition-a4072d228520