2. Miss Gaganpreet Kaur Cheema, Mr. Sukhveer Singh and Ms. Jagminder Kaur Cheema
http://www.iaeme.com/IJCIET/index.asp 52 editor@iaeme.com
Speech recognition is a part of pattern recognition which includes two processes:
speech training and speech recognition. The first stage is training also known as
modeling stage. In this stage, the system learned and summarized the human
language and the learned knowledge is stored to establish a language reference
model. The second stage is identification also known as testing stage.
The system will match the outside input voice messages with the reference model in
the library and get the nearest meaning or semantic recognition results.
2. WORKING OF SPEECH RECOGNITION SYSTEM
2.1. Creation of database
In our case we can show steps to create database i.e. step by step procedure to create
data base. Then we can use figures of waveform which are extracted from the signal
which has to be matched
Acquiring Data with a Sound Card: As shown in Figure 4.1 data acquisition
system a typical data acquisition session consists of these four steps:
1. Initialization: Creating a device object.
2. Configuration: Adding channels and controlling acquisition behavior with properties.
3. Execution: Starting the device object and acquiring or sending data.
4. Termination: Deleting the device object.
Figure 1 Data acquisition system
2.2. Initialization
The first step is to create the analog input object (AI) for the sound card.
AI = analog input (‘win sound’);
2.3. Configuration
Next, we add a single channel to AI, and set the sample rate to 8000 Hz with
acquisition duration of 2 seconds:
Add channel (AI, 1);
Fs = 8000; % Sample Rate is 8000 Hz
set (AI, ‘Sample Rate’, Fs)
3. GUI based Speech Recognition using Frequency Spectrum
http://www.iaeme.com/IJCIET/index.asp 53 editor@iaeme.com
duration = 2; % 2 second acquisition
set(AI, ‘Samples Per Trigger’, duration*Fs);
2.4. Execution
Now, we are ready to start the acquisition. The default trigger behavior is to start
collecting data as soon as the start command is issued. Before doing so, you should strike
the tuning fork to begin supplying a tone to the microphone (whistling will work as well).
Start (AI);
To retrieve all the data
Data = get data (AI);
2.5. Termination
The acquisition ends once all the data is acquired. To end the acquisition session, we
can delete the AI object from the workspace:
Delete (AI)
2.6. Results
Let’s now determine the frequency components of the tuning fork and plot the results.
First, we calculate the absolute value of the FFT of the data.
xfft = abs(fft (data));
3. FREQUENCY SPECTRUM USING EUCLIDEAN DISTANCE
The frequency spectrum of a time-domain (Time domain is the analysis of mathematical
functions, physical signals with respect to time. In the time domain, the signal or
function’s value is known for all real numbers) signal is a representation of that signal in
the frequency domain (frequency domain refers to the analysis of mathematical functions
or signals with respect to frequency, rather than time). The frequency spectrum can be
generated via a Fourier transform of the signal, and the resulting values are usually
presented as amplitude and phase, both plotted versus frequency. A musical tone’s timbre
is characterized by its harmonic spectrum. Spectrum analysis, also referred to as
frequency domain analysis or spectral density estimation, is the technical process of
decomposing a complex signal into simpler parts. As described above, many physical
processes are best described as a sum of many individual frequency components. Any
process that quantifies the various amounts (e.g. amplitudes, powers, intensities, or
phases), versus frequency can be called spectrum analysis. When a sound signal contains
frequencies, distributed equally over the audio spectrum, it is called white noise [2]. In
mathematics, the Euclidean distance or Euclidean metric is the “ordinary” distance
between two points that one would measure with a ruler, and is given by the Pythagorean
formula. The theorem can be written as an equation relating the lengths of the sides a, b
and c, often called the Pythagorean equation.
(4.1)
Where c represents the length of the hypotenuse, and a and b represent the lengths of
the other two sides. By using this formula as distance, Euclidean space (or even any
inner product space) becomes a metric space. The associated norm is called the
Euclidean norm. The Euclidean distance between point’s p and q is the length of the
line segment connecting them ( ). The squared distance between two vectors x =
[x1 x2] and y = [y1 y2] is the sum of squared differences in their coordinates. To
4. Miss Gaganpreet Kaur Cheema, Mr. Sukhveer Singh and Ms. Jagminder Kaur Cheema
http://www.iaeme.com/IJCIET/index.asp 54 editor@iaeme.com
denote the distance between vectors x and y we can use the notation dx,y so that this
last result can be written as:
dxy2 = (x1-y1)2 + (x2-y2)2 (4.2)
i.e, the distance itself is the square root:-
dxy = (4.3)
4. SPEECH PROCESSING PROPOSED
4.1. Procedure
How the technique works to recognize speech of a person and to control appliances:
5. GUI based Speech Recognition using Frequency Spectrum
http://www.iaeme.com/IJCIET/index.asp 55 editor@iaeme.com
Recording voice of two or three persons separately. These will be treated as inputs
to the system and along will also see the frequency ranges of the inputs through plots.
Following separate plots are showing frequency ranges of inputs.
Now again recording voice of ten persons, but voice of above three are also
included in ten. This is done to find the identity of the person through voice. Voice
will be identified through English Speech Recognition Software, this helps to match
the voice.
Now the above ten voices we have taken including first three will be considered as
our database for whole
Technique. And another database is created for Storing English Commands using
the Data Acquisition.
The technique will consider the ten and first three voices simultaneously and runs
the system to get results. Suppose first voice matches with one of from ten, and then
following results are obtained.
And if second from first matches with one of from ten then following frequency
range is obtained.
5. CONCLUSION
The proposed technique (FSAED) gives accuracy up to 96% for different user voice
as compared to the conventional techniques i.e Fourier-Bessel cepstral coefficients for
robust speech recognition. Result for the different SNR (dB) shows accuracy in %.
When white noise 30(dB) added conventional technique gives the accuracy 92.3%.
Similarly, at 30 (dB) car noise gives accuracy 90.8% & music noise gives accuracy
92.6%. The proposed technique gives accuracy 94% at white noise 30(dB) added,
90.9% car noise & 93.4% music noise when 30(dB) added. Performance evaluation
results are shown graphically which indicates that proposed algorithm gives
considerably better results over the existing conventional model.
REFERENCES
[1] Resmi K, Satish Kumar, H.K. Sardana, Radhika Chhabra, “Graphical Speech
Training System for Hearing Impaired”, IEEE on Image Information Processing
(ICIIP), 2011 International Conference, 3-5 Nov 2011, ISBN 978-1-61284-859-4,
pp-1-6.
[2] Prakash Chetana, Gangashetty Suryakanth V. “Fourier-Bessel Cepstral
Coefficients for Robust Speech Recognition”, IEEE on Signal Processing and
Communications (SPCOM), 2012 International Conference, 22-25 Aug
2012,ISBN 978-1-4673-2013-9, pp-1-5.
[3] He Guangji, Sugahara Takanobu, Miyamoto Yuki, Fujinaga Tsuyoshi, Hiroki
Noguchi, Shintaro Izumi, “A 40 nm 144 mW VLSI Processor for Real-Time 60-
kWord Continuous Speech Recognition”, IEEE transaction on Circuits and
Systems, vol 59, no 8, Aug2012, pp-1656-1666.
[4] Virginia Estellers, Mihai Gurban, Jean-Philippe Thiran, “On Dynamic Stream
Weighting for Audio-Visual Speech Recognition”, IEEE transaction on Audio,
Speech, and Language Processing, vol 20, no 4, May 2012, pp-1145-1157.
[5] Shing-Tai Pan, Xu-Yu Li “An FPGA-Based Embedded Robust Speech
Recognition System Designed by Combining Empirical Mode Decomposition
and a Genetic Algorithm”, IEEE transaction on Instrumentation and
Measurement, vol 61, no 9 Sept 2012, pp-2560-2572.
6. Miss Gaganpreet Kaur Cheema, Mr. Sukhveer Singh and Ms. Jagminder Kaur Cheema
http://www.iaeme.com/IJCIET/index.asp 56 editor@iaeme.com
[6] Punit Kumar Sharma, Dr. B.R. Lakshmikantha and K. Shanmukha Sundar “Real
Time Control of DC Motor Drive using Speech Recognition”, IEEE Power
Electronics (IICPE), 2010 India International Conference, 28-30 Jan 2011,ISBN
978-1-4244-7883-5, pp-1-5.
[7] Qun Feng Tan, Panayiotis G. Georgiou and Shrikanth (Shri) Narayanan
“Enhanced Sparse Imputation Techniques for a Robust Speech Recognition
Front-End”, IEEE transaction on Audio, Speech, and Language Processing, vol
19, no 8, Nov 2011, pp-2418-2429.
[8] Nam Soo Kim, Member, IEEE, Tae Gyoon Kang, Shin Jae Kang, Chang Woo
Han, and Doo Hwa Hong “Speech Feature Mapping Based on Switching Linear
Dynamic System” , IEEE transaction on Audio, Speech, and Language
Processing, vol 20, no 2 Feb 2012, pp-620-631.
[9] Dexin Zhou, Jiacang Kang, Zhicheng Fan, Wenlin Zhang “The Application of
Improved Apriori Algorithm in Continuous Speech Recognition”, IEEE
Mechanic Automation and Control Engineering (MACE), 2011 Second
International Conference, 15-17 July 2011,ISBN: 978-1-4244-9436-1,pp-756-
758.
[10] Baifen Liu, “Research and Implementation of the Speech Recognition
Technology Based on DSP”, IEEE Artificial Intelligence, Management Science
and Electronic Commerce (AIMSEC), 2011 2nd International Conference, 8-10
Aug 2011,ISBN: 978-1-4577-0535-9, pp- 4188-4191.
[11] Shing-Tai Pan, Sheng-Fu Liang, Tzung-Pei Hong, Jian-Hong Zeng “Apply Fuzzy
Vector Quantization to Improve the Observation-Based Discrete Hidden Markov
Model. An example on electroencephalogram (EEG) signals recognition” , IEEE
Fuzzy Systems (FUZZ), 2011 IEEE International Conference, 27-30 June 2011,
ISBN 3-642-13497-1 978-3-642-13497-5,pp-1674-1680.
[12] Yong Lu, Haining Huang “Research on a kind of Noisy Tibetan speech
recognition algorithm based on WNN” , IEEE Natural Computation (ICNC),
2011 Seventh International Conference on, vol 2, July 2011, pp-605-608.
[13] Jian Wang, Zhiyan Han, and Shuxian Lun, “Speech Emotion Recognition System
Based on Genetic Algorithm and Neural Network”, IEEE Image Analysis and
Signal Processing (IASP), 2011 International Conference, 21-23 Oct 2011, ISBN:
978-1-61284-879-2, pp- 578-582.
[14] Nemanj a Majstorovic, Milenko Andric, Davorin Mikluc “Entropy-based
algorithm for speech recognition in noisy environment”, IEEE
Telecommunications Forum (TELFOR), Nov 2011,ISBN: 978-1-4577-1499-3,
pp-667-670.
[15] Shing-Tai Pan, Ching-Fa Chen, Wei-Der Chang, Yi-Heng Tsai “Performances
Comparison between Improved DHMM and Gaussian Mixture HMM for Speech
Recognition”, IEEE on Image and Signal Processing (CISP), 2011 4th
International Congress ,vol 5, oct 2011, pp-2426-2430.
[16] Qinglin Qu, Liangguang Li “Realization of Embedded Speech Recognition
Module Based on STM32”,ISCIT on Communications and Information
Technologies (ISCIT), 11th International Symposium, oct 2011,ISBN 978-1-
4577-1294-4, pp-73-77.
[17] M. Kudinov, “Comparison of Some Algorithms for Endpoint Detection for
Speech Recognition Device Used in Cars”,IEEE on Control and Communications
(SIBCON), 2011 International Siberian Conference, Sept 2011,ISBN: 978-1-
4577-1069-8, pp-230-233.
[18] C. Y. Fook, M. Hariharan, SazaliYaacob, Adom AH”A Review: Malay Speech
Recognition and Audio Visual Speech Recognition”, IEEEon Biomedical
7. GUI based Speech Recognition using Frequency Spectrum
http://www.iaeme.com/IJCIET/index.asp 57 editor@iaeme.com
Engineering (ICoBE), 2012 International Conference, Feb 2012, ISBN: 978-1-
4577-1990-5, pp-479-484.
[19] Vikramjit Mitra, Hosung Nam, Carol Y. Espy-Wilson, Elliot Saltzman, and Louis
Goldstein, “Articulatory Information for Noise Robust Speech Recognition”,
IEEE transaction on Audio, Speech, and Language Processing, vol 19, no 7, Sept
2011, pp-1913-1924.
[20] Gabriele Fanelli, Juergen Gall, Harald Romsdorfer, “A 3-D Audio-Visual Corpus
of Affective Communication”, IEEE transaction on multimedia, vol 12, no 6, Oct
2010, pp-591-598.
[21] Panikos Heracleous, Viet-Anh Tran, Takayuki Nagai, and Kiyohiro Shikano,
“Analysis and Recognition of NAM Speech Using HMM Distances and Visual
Information”, IEEE trans. on audio, speech and language processing, vol 18, Aug
2010, pp-1528-1538.
[22] MihaiGurban, Jean-Philippe Thiran “Information Theoretic Feature Extraction
for Audio-Visual Speech Recognition”, IEEE trans. on signal processing, vol 57,
Dec 2009, pp-4765-4776.
[23] Jong-Seok Lee, Cheol HoonPark, “Robust Audio-Visual Speech Recognition
Based on Late Integration”, IEEE trans. on multimedia, vol 10, Aug 2008, pp-
767-779.
[24] Bengt Jonas Borgström, Abeer Alwan”A Low-Complexity Parabolic Lip Contour
Model with Speaker Normalization for High-Level Feature Extraction in Noise-
Robust Audiovisual Speech Recognition”, IEEE trans. Systems, Man and
Cybernetics, Part A: Systems and Humans, vol 38, Nov 2008, pp-1273-1280.
[25] Valentin Ion and Reinhold Haeb-Umbach, “A Novel Uncertainty Decoding Rule
With Applications to Transmission Error Robust Speech Recognition”, IEEE
trans. on audio, speech and language processing, vol 16, no. 5, July 2008, pp-
1047-1060.
[26] Zhihong Zeng, Jilin Tu, Ming Liu, Thomas S. Huang, Brian Pianfetti, Dan Roth,
and Stephen Levinson “Audio-Visual Affect Recognition”, IEEE trans. on
multimedia,, july 2005, vol.9, no.2, pp-424-428.
[27] Carlos Busso, Shrikanth S. Narayanan, “Interrelation between Speech and Facial
Gestures in Emotional Utterances: A Single Subject Study”, IEEE trans. on
audio, speech and language processing, VOL 15, no. 8, Nov 2007, pp-2331-2347.
[28] Belgacem BEN MOSBAH,”Speech Recognition for Disabilities People”, IEEE
Information and Communication Technologies, vol 1.
[29] http://www.mathworks.in/help/techdoc/matlab_prog/f2-43934.html
[30] Asst. Prof. Dr. Jane J. Stephan and Asst. Lecturer Rasha H. Ali. Speech
Recognition using Genetic Algorithm. International Journal of Computer
Engineering and Technology,5(5), 2015, pp. 76-81.