Isolated words recognition using mfcc, lpc and neural network
Audio Based Speech Recognition Using KNN Classification Method.Report
1. Audio Based Speech Recognition Using KNN
Classification Method
Instructor: Dr. Umasankar Kandaswamy
Santosh Kumar Chikoti
schikoti@ltu.edu
Hansong Xu
Hxu1@ltu.edu
2. ABSTRACT
Road traffic crashed is top 9th cause of death worldwide, which has being reported fromassociation for safe
international rode travel [1]. Because of car accident, over 37 thousand people died in US each year.
Ranking at third place, which causing car accident is other form of distracted driving (such as: dial, adjust
temperature reaching stuff. etc) [2]. While, the No.1 and No.2 are speeding and use mobile phone, which
can be noticed and reduced by following the rules. In this work, we introduce Automatic Speech
Recognition (ASR) for car drivers, which can adjust or control part of the high frequently using functions
through input driver’s voice, such as adjust temperature, control windows, open GPS, and dial. etc., but not
limited in those function. Apply ASR system while driving can largely reduce the distracted actions, and
minimize the chance of car accidents happens.
INTRODUCTION
To get maximum applicability of ASR, we split our design into two parts: Driver identification and
command classification. More specifically, for the first part, we using our recorded voice from authorized
driver as training data, and new voice record as testing data, to recognize the driver. Only if the new
recorder is classified to certain group of training data, the command section can be activated. For the
second part, we recorded totally 8 speech commands for training, which are ‘Air conditioner’, ‘window
up’, ‘window down’, ‘engine off’, ‘start engine, ‘make phone call’, ‘GPS on’ and ‘play music’. After the
driver recognition is success in the first step, then, the authorized driver may input speech command for
control functions while driving.
Paper [3] introduced a voice based robot control system, they using the Linear Predicted Coefficient (LPC)
feature, since the LPC have better performance on recognition of isolated words. So, for their robot
command such as ‘go origin’, ‘up’, ‘down’, ‘turn’, they didn’t choose the other complex algorithm. Also, in
paper [4] gives us a totally 8 voice features, which can be used for voice information extraction, such as
LPC, Linear predictive cepstral coefficient (LPCC), perceptual linear predictive coefficients (PLP), Mel-
frequency cepstral coefficient (MFCC). etc. For different classifier, K Nearest Neighbor (KNN) was used
in paper [5] for Parkinson’s disease detection. The KNN was trained with voices from both healthy people
and Parkinson’s disease people. At the result, they get 94.8% accuracy rate for 7 optimized features and
98.2% for 9 features. Which consider as optimal performance from KNN.
Our system structure is described in methodology section followed with detail of LPC and MFCC feature
explanation. The simulation result explained in experimental result part. Followed with conclusion section.
METHODOLOGY
1. Linear Predicted Coefficient (LPC)
LPC have a wide range of performance analysis parameters for the purpose of evaluation, which include
Bit rates,Overall delay of the system, Computational Complexity and Objective Performance Evaluation.
The LPC finds the coefficients of 15th order linear predictor that predicts the current value of the collected
audio sample based on the past samples. Linear predictive analysis is used for the purpose of compressing
signal for transmission and storage. LPC is largely used for medium and low bit rate coder. While we pass
the voice or speech signal from a filter, the residual error is generated as an output.
In pre-processing step, the zero value are removed, shows in left part of figure 1. The total 13 LPC feature
data from a name speech recording shows in right part of figure 1.
3. Figure 1 Zero value removal and LPC feature extraction
2. MFCC
The use of Mel frequency cepstral coefficients can be considered as one of the standard method for feature
extraction. MFCCs are co-efficient that collectively make up MFC. They are derived from a type of
cepstral representation of the voice memo. Here the frequency bands are equally spaced to the Mel scale.
In this method, the spectrum is warped according to the Mel scale. This is very similar to the perceptual
linear predictive analysis of speech and the short-term spectrum is modified based on the spectral
transformation. Mel scale cepstral analysis uses cepstral smoothing to smooth the modified power
spectrum. This is done by direct transformation of the log power spectrum to the cepstral domain using an
inverse Discrete Fourier Transform (DFT).
Figure 2 the steps for extracting MFCC features
Figure 3 13 rows of MFCC feature data from command speech
FFT
•Fourier transform for the collected audio X
Mel-
scale
•Mel-frequency wrapping –mel spectrum
DCT
•Discrete cosine wavelet matrix computation
MFCC
•13 rows of cepstral coefficients
4. 3. KNN
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a
similarity measure (e.g., Euclidean distance functions).
Set the training features and classes
Find the features for new data sets
Compare the testing data with training data sets
Label the minimum distance for the new data feature.
Figure 4 general structure of our proposed method
EXPERIMENTAL RESULT
Our simulation work can be followed in five steps:
1. First we give voice input to the systemby recording in Matlab recorder.
2. From the created data set, we extract features from the given input voice.
3. As we have two data sets of ‘Jason ‘ and ‘Santosh’, extraction of features are done with same data
set of speaker.
4. Using KNN, selection of nearest neighbors are selected and matched with the given input, where
K=[1,10].
5. For the perfect match of maximum nearest neighbor hood points we get accurate result.
In above simulation work, we recorded 2 second for both of our speech of our own name for 20 times, at
sampling rate at Fs=8000 per second. The total training data set is 2 × 8000 × 20 total 20 frames. For the
LPC feature, 10 feature data for each frame were extracted. Total feature frame for LPC feature is 10 × 20
from both of us. For the command recognition, we recorded 2 second of 10 commands voice from both of
us for 20 times as well. For the command speech, MFCC feature was calculate for each command data
frame. Totally, 13 rows of MFCC feature data was extracted from each command’ speech.
Then, for testing, we calculated the LPC and MFCC features from recorded voice. Then, K Nearest
Neighbor classifier compared the Euclidean distance of input features with training sets. The accuracy
performance for driver identification shows in figure 6, with total tested 10 times. The accuracy for each
feature shows as figure 7, with total tested 30 times, separately.
5. Person identification 20 10 (jason) 10 (san)
Correctly identified 8 9
Performance 80% 90%
Figure 6 driver identification performance
Total tested
30 times
Air
conditioner
Window
up
start
engine
Engine
off
Window
down
make
a
phone
call
GPS
on
Play
music
Correctly
identified
29 18 15 29 22 29 15 28
Performance 96.6% 60% 50% 96.6% 73.3% 96.6% 50% 93.3%
Figure 7 commands classification accuracy
CONCLUSION
In this project, we did a speech based voice recognition system, which can be used for controlling in-
vehicle function. Based on this method, we introduced the driver identification and speech based command
classification. By this way, the security of vehicle itself can be increased and potential risk of car accidents
can be reduced as well. We trained our KNN classifier with recorded training command speech and name
speech. Then, we tested this method with new recording for both command and name. Classification
accuracy of our system can be up to 90% and 96.6% separately. Future scope of our work will be words
(letters) recognition and voice (person) recognition individually.
REFERENCE:
[1]: ASIRT “Annual Global Road Crash Statistics” http://asirt.org/Initiatives/Informing-Road-Users/Road-Safety-Facts/Road-Crash-
Statistics
[2]: Attorney ‘Top10Causes of Car Accidents’ December18,2013 Car Accident Attorney
http://www.losangelespersonalinjurylawyers.co/top-10-causes-of-car-accidents/
[3]:Luo Zhizeng and Zhao Jinghing “Speech Recognition and Its Application in Voice based Robot Control System” Conference:
Intelligent Mechatronics andAutomation, 2004. Proceedings. 2004International Conference on
[4]:U. Shrawankar and V. Thakare. Techniques for Feature Extraction In Speech Recognition System : A Comparative Study. Arxiv e-
prints, May2013.
[5]:R. Arefi Shirvan, E. Tahami, Voice analysis for detecting parkinson's desiease using genetic algorithm and KNN classification
method, Proc 18th Int Con onBiomedical Enghineering, Tehran, pp. 550-555, 2011.
[6]; Tsang-LongPao; TatungUniv., Taipei ; Wen-Yuan Liao; Yu-Te Chen “Audio-Visual Speech Recognitionwith Weighted
KNN-basedClassificationin MandarinDatabase” Intelligent InformationHidingandMultimedia Signal Processing, 2007.IIHMSP
2007. ThirdInternational Conferenceon (Volume:1), 26-28 Nov. 2007 pp39 - 42