SlideShare a Scribd company logo
1 of 123
Download to read offline
Ultimate Speech Search
Page i
ABSTRACT
In the modern era people tends to find information where ever they can in a more
efficient way. They search for the knowledge from past events so does the present events.
Searching for a particular thing evolves a search engine and the necessary information. When
they want to learn out of speeches or lectures done by any one they are going for a desperate
search without knowing the actual results. If they have a luxury of a search engine that would
give the required results that would be a blessing for their work.
This project totally aims for build a search engine that will able to search for
speeches and lectures by their content. Every search engine supports the feature of searching,
but the results may be a jargon. The user has to go one by one and sometimes at the end of
the day they will end up will a null result. The main goal of this project is to provide a search
facility by the content.
This research covers converting a speech in to text with a bit of noise analysis,
maintaining a database with clustered indexing and a simple search facility by the content.
The system that would build operates on a limited data such as speeches and lectures in a low
noisy environment and as for the future enhancement it would be able to search for music or
any other sound stream by the analysis of the spectrum with user friendly search facility.
KEY WORDS Search Engine, Speeches, Lectures, Noise Analysis, Content, Spectrum
Ultimate Speech Search
Page ii
ACKNOWLEDGEMENTS
My sincere gratitude goes to my grandfather who taught me the ways of life and who
raised me up from my childhood to a teenager and left me in a May.
I would like to thank to my friends those who help me in my difficult times and praised me in
my good times. I would like to thank to my college teachers who beat me from canes to make
me a good man and gave me the knowledge to face the society.
I would like to thank for my sister who always be a mother to me and I would like to show
my gratitude for my supervisor Mrs. Nadeera Ahangama who guide to throughout the project.
Finally I would like to thank to the APIIT staffs who provide us with necessary facilities to
achieve our higher education and make it a success.
Ultimate Speech Search
Page iii
Table of Contents
ABSTRACT................................................................................................................................i
ACKNOWLEDGEMENTS.......................................................................................................ii
List of Figures..........................................................................................................................vii
List of Equations.................................................................................................................... viii
List of Tables ............................................................................................................................ix
INTRODUCTION .....................................................................................................................1
1.1 Project Background.....................................................................................................1
1.2 Problem Description....................................................................................................2
1.3 Project Overview.........................................................................................................4
1.3.1 Noise analysis ......................................................................................................4
1.3.2 Speech recognition...............................................................................................4
1.3.3 Speech to text conversion ....................................................................................4
1.3.4 The database.........................................................................................................5
1.3.5 The search engine ......................................................................................................5
1.4 Project Scope...............................................................................................................6
1.5 Project Objectives .......................................................................................................7
RESEARCH...............................................................................................................................8
2.1 Speech Recognition..........................................................................................................8
2.2 Speech recognition methods...........................................................................................13
2.2.1 Hidden Markov methods in speech recognition......................................................13
2.2.2 Client side speech recognition.................................................................................16
2.2.5 Continuous speech recognition................................................................................18
2.2.6 Direct Speech Recognition ......................................................................................18
2.3 Speaker Characteristics ..................................................................................................19
2.3.1 Speaker Dependent..................................................................................................19
2.3.2 Speaker Independent................................................................................................19
2.3.3 Conclusion...................................................................................................................20
2.4 Speech Recognition mechanisms...................................................................................21
2.4.1 Isolated word recognition........................................................................................21
Ultimate Speech Search
Page iv
2.4.2 Continuous speech recognition................................................................................22
2.4.3 Conclusion...............................................................................................................23
2.5 Vocabulary Size .............................................................................................................24
2.5.1 Limited Vocabulary.................................................................................................24
2.5.2 Large Vocabulary ....................................................................................................24
2.5.3 Conclusion...............................................................................................................24
2.6 Speech recognition API‟s...............................................................................................25
2.6.1 Microsoft Speech API 5.3 .......................................................................................25
2.6.2 Java Speech API......................................................................................................26
2.7 Speech Recognition Algorithms ....................................................................................31
1. 8 Noise Filtering ...........................................................................................................32
1.8.1 Weiner filtering..................................................................................................33
1.8.2 Conclusion .........................................................................................................33
2.9 Database and data structure............................................................................................34
2.9.1 Conclusion...............................................................................................................34
2.10 Search Engine...............................................................................................................35
2.11 MATLAB.....................................................................................................................36
ANALYSIS..............................................................................................................................37
3.0 System requirements .................................................................................................37
3.11 Functional requirements ........................................................................................37
3.1.2 Non functional requirements ...................................................................................37
3.1.3 Software Requirements............................................................................................38
3.1.4 Hardware requirements............................................................................................39
3.2 System Development Methodologies.............................................................................40
3.2.1 Rational Unified Process .........................................................................................40
3.2.2 Agile Development Method ....................................................................................43
3.2.3Scrum Development Methodology...........................................................................45
3.3 Test Plan.........................................................................................................................47
3.3.1System testing...........................................................................................................47
SYSTEM DESIGN..................................................................................................................48
4.1 Use Case Diagram.....................................................................................................48
Ultimate Speech Search
Page v
4.2 Use case description.......................................................................................................50
4.2.1 Use case description for file upload ........................................................................50
4.2.2Use Case description for play an audio file..............................................................51
4.2.3 Use Case description for search...............................................................................52
4.2.4 Use Case description for noise reduced output .......................................................53
4.2.5 Use Case description for noise filtering ..................................................................54
4.3 Activity Diagrams ..........................................................................................................55
4.3.1Activity Diagram for Speech Recognition System...................................................55
4.3.2 Activity Diagram for Noise filtering .......................................................................56
4.4 Sequence Diagrams........................................................................................................57
4.4.1 Select a file ..............................................................................................................57
4.4.2 Play wav file ............................................................................................................58
4.4.3Speech recognition pre stage ....................................................................................59
4.4.4Speech Recognition post stage .................................................................................60
4.5 Class Diagrams...............................................................................................................61
4.5.1 GUI and the system .................................................................................................61
4.5.2 Speech recognition ..................................................................................................62
4.6 Noise Filtering................................................................................................................64
4.7 Code to filter noise in C Language.................................................................................67
CHAPTER 5 ............................................................................................................................73
5.0 Implementation ..................................................................................................................73
CHAPTER 6 ............................................................................................................................78
6.0 Test Plan.............................................................................................................................78
6.1 Background ....................................................................................................................78
6.2 Introduction....................................................................................................................78
6.3 Assumptions...................................................................................................................79
6.4 Features to be tested .......................................................................................................79
6.5 Suspension and resumption criteria ...............................................................................80
6.6 Environmental needs......................................................................................................81
6.7 System testing ................................................................................................................82
6.8 Unit testing.....................................................................................................................83
Ultimate Speech Search
Page vi
6.9 Performance Testing ......................................................................................................89
6.10 Integration Testing .......................................................................................................92
CHAPTER 7 ............................................................................................................................94
CRITICAL EVALUATION AND FUTURE ENHANCEMENTS ........................................94
7.1Critical evaluation ...........................................................................................................94
7.2 Suggestions for future enhancements.............................................................................99
8.0 Conclusion ..................................................................................................................101
REFERENCES ......................................................................................................................102
BIBLIOGRAPHY..................................................................................................................106
APPENDIX A........................................................................................................................107
APPENDIX B........................................................................................................................114
Gantt chart..........................................................................................................................114
Ultimate Speech Search
Page vii
List of Figures
Figure 1: Overview of Steps in Speech Recognition.................................................................8
Figure 2 : Graphical Overview of the Recognition Process ....................................................10
Figure 3: Components of a typical speech recognition system................................................12
Figure 4 : example of HMM for word “Yes” on an utterance.................................................15
Figure 5: Overview of Microsoft Speech Recognition API ...................................................25
Figure 6 : Java Sound API Architecture ..................................................................................29
Figure 7 : JSGF Architecture...................................................................................................30
Figure 8: Noise in Speech........................................................................................................32
Figure 9 : Database Indexing...................................................................................................34
Figure 10 : Google Architecture ..............................................................................................35
Figure 11 Phases in RUP .........................................................................................................41
Figure 12 : Overview of Agile.................................................................................................43
Figure 13 : Scrum Overview....................................................................................................46
Figure 15 : Use Case Diagram for System...............................................................................48
Figure 16 Speech Recognition.................................................................................................55
Figure 17 Activity Diagram Noise Filtering...........................................................................56
Figure 18 Sequence Diagram Select a file...............................................................................57
Figure 19 Sequence Diagram Play File ...................................................................................58
Figure 20 Sequence Diagram SR Pre Stage............................................................................59
Figure 21 Sequence Diagram SR Post Stage..........................................................................60
Figure 22 Class Diagrams GUI & System...............................................................................61
Figure 23 Class Diagram SR System.......................................................................................62
Figure 24 : Speech Search Class Diagram...............................................................................63
Figure 25: SR Engine...............................................................................................................73
Figure 26 Open file..................................................................................................................74
Figure 27: Text output .............................................................................................................75
Figure 28 Speech Search Engine .............................................................................................77
Ultimate Speech Search
Page viii
List of Equations
Equation 1 : First order Markov chain.....................................................................................13
Equation 2: Stationary states Transition ..................................................................................14
Equation 3: Observations independence..................................................................................14
Equation 4: observation sequence............................................................................................14
Equation 5 : Left Right topology constraints...........................................................................15
Equation 6: CSR Equations .....................................................................................................22
Ultimate Speech Search
Page ix
List of Tables
Table 1: Typical parameters used to characterize the capability of speech recognition system9
Table 2 : Comparison in different techniques in speech recognition.......................................17
Table 3: Isolated word recognition ..........................................................................................21
Table 4 : Use Case description file upload ..............................................................................50
Table 5 Use Case description play audio.................................................................................51
Table 6 Use Case description search .......................................................................................52
Table 7 Use Case description noise reduction .........................................................................53
Table 8 Use Case description noise process ............................................................................54
Table 9 Test Case 1..................................................................................................................83
Table 10 Test Case 2................................................................................................................84
Table 11 Test Case 3................................................................................................................85
Table 12 Test Case 4................................................................................................................86
Table 13 Test Case 5................................................................................................................87
Table 14 Test Case 6................................................................................................................88
Table 15: Performance testing windows XP............................................................................89
Table 16 : Performance Testing on UBUNTU ........................................................................90
Ultimate Speech Search
Page 1
CHAPTER 1
INTRODUCTION
1.1 Project Background
Throughout the history of human civilization time played a key role. Humans achieved
Technological advancement, scientific breakthroughs and unfortunately drawbacks within
certain time goals. In many cases these time goals were set by nature.
According to sooths point of view now we are live in an advanced era compared to
prehistoric eras. We all are actors in another part of a chronicle play in our time. Due to the
globalization distances in this planet narrowing. Within a shorter time limit people forced to
accomplish objectives and goals and most of the time they are lacking certain amount of time
in order to make it a success.
Some part of a society ask to accomplish a goal they may go for a research , interviews or
various any other fact finding techniques. Just imagine that they need to find certain
information from lectures and speeches. Can they find the appropriate resource materials in a
minimum time and with a minimum effort?
They have to go through many search results and they have to commit most of their valuable
time for a worthless task. If there is a way to find the lectures and speeches by searching by
their content we could guarantee that we can save our valuable time in a respectable manner
and we can invest this valuable time for deeds in sake of the planet earth.
Ultimate Speech Search
Page 2
1.2 Problem Description
The problem is to provide with the users with a search engine in order to search lectures and
speeches by their content for various purposes.
In order to do this we have to come up with fair solutions for the challenges that meet
throughout this process and they are as follows.
Noise analysis: - we have to analyze the nature of the speech or the lecture. Speeches and
lectures may come from various surrounding environmental conditions. This may directly
effect to the vocal part of the speech. So we have to reduce the noise as much as possible.
Speech recognition: - speech recognition is a vast area. Speeches can be done by many
personalities with different accents. Each individual has his/her own accent when speaking in
English or any other languages. In order to recognize the words they spoken we have to do a
deep research in order to build a speech recognition server to overcome the speech
recognition challenge.
Speech to text conversion:-Speech to text conversion is one of the key areas of this project
because it‟s the key point to build the database that contains the text version of speeches and
lectures.
The database: - All the converted versions of the speeches and lectures will be saved in the
database.
The search engine: - This is another challenging area of the project. The search engine will
show the appropriate search results from the database. I need to find the searching
Ultimate Speech Search
Page 3
mechanisms and methods for the search in order to give the user with efficient and accurate
results.
Database and the search engine are two parallel problems that need to be developed
more precisely. Without a proper structure for the database it‟s tedious to implement search
functionality.
Ultimate Speech Search
Page 4
1.3 Project Overview
The main challenge area of this project is to build the database containing the text version of
speeches and lectures. In order to accomplish these phenomena we have to perform some
tasks.
1.3.1 Noise analysis
A noise analysis will perform in order to ensure an efficient speech to text conversion. This
will enables us to isolate the human voice and remove the background environment in the
audio file. This may include background noise such as tape hiss, electric fans or hums, etc.
1.3.2 Speech recognition
Speech recognition comes in two flavors. They are speaker independent and speaker
dependent. The voice of the speaker or the lecturer may change. Because of that the project
uses speaker independent speech recognition.
1.3.3 Speech to text conversion
The system converts the speech in text format in order to build the database. The database
consists with the converted text version of the speeches and the lectures.
Ultimate Speech Search
Page 5
1.3.4 The database
The database consists two parts. They are the converted (speech to text) speech file or the
lecture file and the actual source files contains audio.
1.3.5 The search engine
The search engine search for the content of a speech or a lecture from the database and gives
the actual results. We might need to do something like summarizing. So the user can search
from the content more easily by typing a sentence or a word.
Ultimate Speech Search
Page 6
1.4 Project Scope
Existing search engines wont facilitates for search for a speech by its content. This system
gives you the facility to search a speech by its content. The system contains data about
English speeches and lectures.
These speeches and lectures were done in a low noisy environment because the system
would perform a less noise analysis. The system won‟t store music because the amount of
noise analysis in higher compared to a low noisy environment.
The speech recognition engine that going to build only supports for the English speeches and
lectures and the noise analysis will only supports for the English speeches and lectures and
speeches.
The system will convert speeches and lectures (low noise) to text format. After the
development process users will able to search from anywhere on this planet for a required
result.
Speaker independent speech recognition will be used because the system deals with different
type of speeches performed by different persons with different accents.
Ultimate Speech Search
Page 7
1.5 Project Objectives
1.0 Noise analysis and reduction
The system will performs noise filtering. This helps the speech recognition process. The
noisy signal channel will analyzed and split in to two parts. Amplitude of the noisy channel
set to low in value. An efficient noise filtering mechanism will use.
2.0 Continuous speech recognition system
To develop an efficient speech recognition engine to convert speeches and lectures to a text
format Speeches performed by various persons will be translated in to text format.
3.0 The Database
Database implementation Converted version of the speeches and lectures will be stored in the
data base in text format and the relevant speech or the lecture will be stored in another
database
4.0 The search engine
The search engine search for the content of a speech or a lecture from the database and gives
the actual results. We might need to do something like summarizing. So the user can search
from the content more easily by typing a sentence or a word.
Ultimate Speech Search
Page 8
CHAPTER 2
RESEARCH
2.1 Speech Recognition
The process of converting a phonic signal captured by a phone or a microphone or any other
audio device to a set of words is called speech recognition. Speech recognition is used in
command based applications such as data entry control systems, documentation preparation,
automation of telephone relay systems, in mobile devices such as in mobile phones and to
help people with hearing disabilities.
According to Professor Todd Austin (2007) Speech recognition is the task of translating an
acoustic waveform representing human speech into its corresponding textual representation.
Source(Aoustin,T. (2007). Speech Recognition. Available:
http://cccp.eecs.umich.edu/research/speech.php. Last accessed 17 July 2009. )
Figure 1: Overview of Steps in Speech Recognition
Ultimate Speech Search
Page 9
Applications that support speech recognition are “introduced on a weekly basis and speech
technology are rapidly entering new technical domains and new markets” (Java Speech API
Programmers Guide, 1998)
According to Zue et al. (2003), Speech recognition is a process that converts an acoustic
signal which can be captured by a microphone, to a set of words. Speech recognition systems
can be categorized by many parameters.
Parameters Range
Speaking mode Isolated words to continues speech
Speaking Style Read Speech to spontaneous speech
Enrolment Speaker dependent to speaker independent
Small Small (<20 words) to large (>20000 words)
Language Model Finite state to context sensitive
Perplexity Small(<10) larger(>100)
SNR High(>3dB) to low (<20dB)
Transducer Voice cancelling microphone to telephone
Table 1: Typical parameters used to characterize the capability of speech recognition
system
Ultimate Speech Search
Page 10
According to Hosom et al. (2003), “The dominant technology used in Speech Recognition is
called the Hidden Markov Model (HMM)”. There are four basics steps in performing speech
recognition. They can be seen in the figure below.
[Source: Hosom et al., 1999]
Figure 2 : Graphical Overview of the Recognition Process
Ultimate Speech Search
Page 11
During pass few years speech recognition systems have achieved a remarkable success such
their capability of recognition accuracy rate sometimes results over 98 percent. But that such
accuracy rate was achieved in quite environments and by using sample words in training. It
has been said that a good speech recognition system must be able to achieve good
performance in many circumstances such as a noisy environment. Noise can come on many
flavors.
Air conditions , fans , radios , coughs , tape hiss , cross talks channel distortions , lips smack
, breath noise , pops , sneeze are the basic factors that are engage in making a noisy
environment.
Typical component of a speech recognition system composed of Training data , Acoustic
model , Language model , Training model, Lexical model, Speech signal, Representation,
Model Classification , Search and Recognize words.
Ultimate Speech Search
Page 12
The figure below shows these components geometry in a speech recognition system.
Figure 3: Components of a typical speech recognition system.
Ultimate Speech Search
Page 13
2.2 Speech recognition methods
There is only few speech recognition methods are prevailing. They are categorizing as for the
mobile devices and for standalone applications.
2.2.1 Hidden Markov methods in speech recognition
Andre Markov is the founder of Markov process. Markov model involves probability and it
uses over a finite sets usually called its states.
When a state transition occurs it generates a character from the process. This model has a
finite state Markov chain and a finite set of output probability distribution. Hidden Markov
Constrains for speech recognition systems
1 – First order Markov chain.
This has been made by the assumption that the probability of transition to a state depends
only on the current state
𝑃 𝑞𝑡 + 1 =
𝑆𝑗
𝑞𝑡
= 𝑆𝑖 , 𝑞𝑡 − 1 = 𝑆𝑘 , 𝑞𝑡 − 2 = 𝑆𝑤 , … . . , 𝑞𝑡 − 𝑛 = 𝑆𝑧 𝑃 𝑞𝑡 + 1 = 𝑆𝑗
𝑞𝑡
= 𝑆𝑖
Equation 1 : First order Markov chain
Ultimate Speech Search
Page 14
2 – Stationary states Transition.
This assumption proved that the state changes are mutually exclusive from the time.
𝑎𝑖𝑗 = 𝑃 𝑞𝑡 + 1 = 𝑆𝑗 𝑞𝑡 = 𝑆𝑖
Equation 2: Stationary states Transition
3 – Observations independence.
This assumption regards to the state changes depend only on the underline Markov chain.
However this assumption was depreciated.
𝑃
𝑂𝑡
𝑂𝑡
− 1, 𝑂𝑡 − 2, … . . , 𝑂𝑡 − 𝑝 , 𝑞𝑡 , 𝑞𝑡 − 1 , 𝑞𝑡 − 2 , … . 𝑞𝑡 − 𝑝
= 𝑃
𝑂𝑡
𝑞𝑡 , 𝑞𝑡 − 1 , 𝑞𝑡 − 2 , … . 𝑞𝑡 − 𝑝
Equation 3: Observations independence
Where “p “represents considered history of the observation sequence.
𝑏𝑗 𝑂𝑡 = 𝑃
𝑂𝑡
𝑞𝑡
= 𝑗
Equation 4: observation sequence.
Ultimate Speech Search
Page 15
4 – Left-Right topology constraint:
𝑎𝑖𝑗 = 0 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑗 > 𝑖 + 2 𝑎𝑛𝑑 𝑗 < 𝑖 { 1 𝑓𝑜𝑟 𝑖 1 1 0 𝑓𝑜𝑟 1 𝑖 𝑁 ( ) = < £
= 𝑖 𝑖 𝑝 𝑃 𝑞 𝑆
Equation 5 : Left Right topology constraints
The figure below shows an example of HMM for word “Yes” on an utterance.
Figure 4 : example of HMM for word “Yes” on an utterance
Ultimate Speech Search
Page 16
2.2.2 Client side speech recognition
According to Hosom et al. (2003), Client Side - Speech Recognition is technology that allows
a computer to identify the words that a person speaks into a microphone or telephone. The
basic advantages of having client side speech recognition are it assures a faster response time
because all the processing handled in the client side. The other advantage is it does not use
any network connections like GPRS. According to Hagen at el. (2003, p.66) the problems of
client side speech recognition is, Recognition accuracy and Running time (power
consumption).
2.2.3 Dynamic Time wrapping based speech recognition
This method was used in past decades but now has been depreciated. This algorithm
measures similarities between two sequences which may vary in time or speed. Number of
templates being used in order to perform automatic speech recognition in Dynamic Time
Wrapping based speech recognition. This process involves normalization of distortion and the
lowest normalized distortion is identified as a word.
2.2.4 Artificial Neural Networks
The mechanism inside ANN is to filter the human speech frequencies from the other
frequencies due to the fact that the non speech sound covers higher frequency range than
speech.
Ultimate Speech Search
Page 17
The table below shows a comparison between different speech recognition mechanisms.
Source [anon. (nd). School of Electrical, Computer and Telecommunications
Engineering. Available: http://www.elec.uow.edu.au/staff/wysocki/dspcs/papers/004.pdf].
Last accessed 23rd
August 2009.]
Table 2 : Comparison in different techniques in speech recognition
Ultimate Speech Search
Page 18
2.2.5 Continuous speech recognition
Continuous speech recognition applies is used when a speaker pronounce words sentence or
phrase that are in a series or specific order and are dependent on each other, as if linked
together. This system operates on a system that words are connected to each other and not
separated by pauses.
Because there is more variety of effects it‟s a tedious task to manipulate it. Co articulation is
another series issue in continuous speech recognition. . The effect of the surrounding
phonemes to a single phoneme is high. Starting and ending words are affecting by the
following words and also affected by the speed of the speech.
It‟s harder to track down a fast speech. Two algorithms are usually involves in Continuous
speech recognition. They are Viterbi Algorithm and Baum Welch Algorithm.
2.2.6 Direct Speech Recognition
This process is responsible for identify the speech such that from a word by word and it
follows by pauses.
Ultimate Speech Search
Page 19
2.3 Speaker Characteristics
2.3.1 Speaker Dependent
Speaker Dependent speech recognition systems are developed for a single user purpose only.
No other user can use the system and it will function with only a single user. These systems
subjected to train by the user for the functionality purpose.
One such advantage is that these kinds of systems support more vocabulary than the speaker
independent system and the disadvantage is the limitation of usage for the type of users. This
technology is used in steno masks
.
2.3.2 Speaker Independent
Speaker Independent speech recognition systems are harder to implement relative to the
speaker dependent speech recognition systems. The system need to recognize the patterns and
different accents spoken by many users. The advantage of this system is it can be used by
many users without training.
The most important steps in order to build a speaker independent SRS is to identify what
parts of speech are generic, and which ones vary from person to person. The Speaker
dependent speech recognition can be used by many users despite they are harder to
implement.
Ultimate Speech Search
Page 20
2.3.3 Conclusion
Speaker Independent speech recognition system has been selected for the project because the
system has to deal with many speeches done by many users.
The speech accent and phoneme patterns are different from a speaker to a speaker and it‟s not
possible to perform an individual training for each and every speaker.
Java Speech API only supports for speaker independent speech recognition systems and
that‟s another reason to select speaker independent speech recognition.
Ultimate Speech Search
Page 21
2.4 Speech Recognition mechanisms
2.4.1 Isolated word recognition
This identifies a single word at a time and pauses are involved between words. Isolated word
recognition is the primary stage of speech recognition and it widely used in command based
applications.
Isolated speech recognition needs a less processing power and primary patter matching
algorithms evolved.
Table 3: Isolated word recognition
Ultimate Speech Search
Page 22
2.4.2 Continuous speech recognition
According to Hunt, A. (1997) Continuous speech is more difficult to handle because it is
difficult to find the start and end points of words and Co articulation - the production of each
phoneme is affected by the production of surrounding phonemes.
According to Peinado & Segura (2006, p.9), there are three types of errors in Continuous
speech recognition systems.
Substitutions - recognized sentence have different words substituting original words.
Deletions - recognized sentence with missing words.
Insertions - recognized sentence have new/extra words. Error rate calculation in Continuous
speech recognition by Stephen at el. (2003, p.2)
𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 =
𝐻1
𝑁2
𝑥 100%
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑁 − 𝐷3 − 𝑆4 − 𝐼
𝑁
𝑥 100%
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝐻 − 𝐼
𝑁
𝑥 100%
𝑊𝑜𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒 =
𝑆 + 𝐷 + 𝐼5
𝑁
𝑥 100%
Equation 6: CSR Equations
1
Number of words correctly recognized
2
Total number of words in the actual speech
3
Deletions
4
Substitutions
5
Insertions
Ultimate Speech Search
Page 23
2.4.3 Conclusion
As for the project continuous speech recognition mechanism has chosen because the system
going to deal with continuous speeches in order to build a database and the back end of the
system serves as a standalone application.
Ultimate Speech Search
Page 24
2.5 Vocabulary Size
Vocabulary is the amount of words that known by a person. Greater the vocabulary size, the
depth that he know is higher. The same rule applies for speech recognition systems.
2.5.1 Limited Vocabulary
Limited vocabulary systems have a limited number of words. This can be varies 100 to 10000
words. These systems need a less processing power and more suitable for mobile devices.
2.5.2 Large Vocabulary
Large Vocabulary size for a speech recognition system mainly used in servers or stand alone
applications and evolves more processing power. It will identify almost every word speak by
a person. This vocabulary has more than 10000 words.
2.5.3 Conclusion
Large Vocabulary has been chosen for the project because the project‟s main processes are
handled by standalone applications and it has to collaborate with many speeches.
Ultimate Speech Search
Page 25
2.6 Speech recognition API’s
2.6.1 Microsoft Speech API 5.3
Microsoft Speech API reduces the coding overload from the programmers. It‟s equipped with
speech to text and text to speech recognition.
This API requires a .NET based building environment and have to purchase. Scope of Speech
Application Programming Interface or SAPI lies within windows environments. It allows the
use of speech recognition and speech synthesis within Windows applications.
Applications that use SAPI include Microsoft Office, Microsoft Agent and Microsoft Speech
Server.
In general SAPI defines a set of interfaces and classes to develop dynamic speech recognition
systems. SAPI uses two libraries for its front end and for its back end. For front end it uses
the “Fast format” library. For the back end SAPI uses the “Pantheios”. Both these are C++
open source libraries.
Figure 5: Overview of Microsoft Speech Recognition API
Ultimate Speech Search
Page 26
2.6.2 Java Speech API
Java Speech API provides the both speech recognition and synthesis capabilities and it is
freely available. JSAPI supports for multiple platform development and supports open source
and non open source third party tools. JSAPI package comprises with java.speech,
javax.speech.recognition and javax.speech.synthesis.
Sun Micro Systems build JSAPI in collaboration with
 Apple Computer, Inc.
 AT&T
 Dragon Systems, Inc.
 IBM Corporation
 Novell, Inc.
 Philips Speech Processing
 Texas Instruments Incorporated
It supports speaker independent speech recognition and W3C standards.
Speech recognizer‟s capabilities:
 Built-in grammars (device specific)
 Application defined grammars
Speech synthesizer‟s capabilities:
 Formant synthesis
 Concatenate synthesis
Ultimate Speech Search
Page 27
Java Speech API specifies a cross-platform interface to support command and control
recognizers, dictation systems and speech synthesizers. Java Speech API has two
technologies. They are speech synthesis and speech recognition. Speech synthesis provides
the reverse process of producing synthetic speech from text generated by an application, an
apple, or a user.
With the synthesis capabilities developer‟s can build applications to generate speech from the
text.
There are two primary steps to produce speech from a text.
Structure analysis: Processes the input text to determine where paragraphs, sentences, and
other structures start and end. For most languages, punctuation and formatting data are used
in this stage.
Text pre-processing: Analyzes the input text for special constructs of the language. In
English, special treatment is required for abbreviations, acronyms, dates, times, numbers,
currency amounts, e-mail addresses, and many other forms. Other languages need special
processing for these forms, and most languages have other specialized requirements.
Speech recognition grants the privileges for the computer to listen to human speech
understand and recognize and converts in to text.
Ultimate Speech Search
Page 28
There are some steps in order to build a speech recognition system.
 Grammar design: Defines the words that may be spoken by a user and the patterns in
which they may be spoken.
 Signal processing: Analyzes the spectrum characteristics of the incoming audio.
 Phoneme recognition: Compares the spectrum patterns to the patterns of the
phonemes of the language being recognized.
 Word recognition: Compares the sequence of likely phonemes against the words and
patterns of words specified by the active grammars.
 Result generation: Provides the application with information about the words the
recognizer has detected in the incoming audio.
Behalf of JSAPI we need another two Java API‟s. They are Java Sound API and Java
media frame work. Java sound API has the capabilities of handling sounds and it‟s
equipped with a rich set of classes and interfaces that directly deals with incoming sound
signals. Java Sound API widely used for the following areas and industries.
 Communication frameworks, such as conferencing and telephony
 End-user content delivery systems, such as media players and music using streamed
content
 Interactive application programs, such as games and Web sites that use dynamic
content
 Content creation and editing
 Tools, toolkits, and utilities
Java sound API uses a hardware independent architecture. It is designed to allow different
sorts of audio components to be installed on a system and accessed by the API.
Ultimate Speech Search
Page 29
With Java Sound API we can process both the MIDI 6
and wav sound formats.
Java media frame work is a recently developed frame work which can be used to build
dynamic multimedia applications.
6
Musical Instrument Digital Interface
Figure 6 : Java Sound API Architecture
Ultimate Speech Search
Page 30
2.6.2.1 Java Speech and Grammar format
JSGF or Java speech and Grammar Format was built by the Sun Micro systems. It defines the
set of rules and words for speech recognition. JSGF is plat form independent specification
and it is derived from the Speech recognition Grammar Specification.
The Java Speech Grammar Format has been developed for use with recognizers that
implement the Java Speech API. However, it may also be used by other speech recognizers
and in other types of applications.
A typical grammar rule is a composition of what to be spoken, the text to be spoken and a
reference to other grammar rules. A JSGF file comes in a normal file format or in XML
format.
source (anon. (nd). JSGF Architecture. Available: http://www.cs.cmu.edu/. Last accessed
24th
july 2009.)
Figure 7 : JSGF Architecture
Ultimate Speech Search
Page 31
2.7 Speech Recognition Algorithms
Viterbi Algorithm is widely used in speech recognition. It is supports for dynamic
programming. This algorithm directly deals with the hidden Markov methods. Baum Welch
Algorithm is another algorithm used in this process. It evolves probability and maximum
likelihood. Forward Backward algorithm is another algorithm used in this process and it is
directly deals with hidden Markov methods. There are three steps in this algorithm.
 Computing forward probabilities
 Computing backward probabilities
 Computing smoothed values
A combination of the above algorithms (a customized version) will use in the project.
Ultimate Speech Search
Page 32
1. 8 Noise Filtering
Noise can be emerged in a speech by tape hiss, clapping, cough or by any other relevant
environmental or machinery factors. Noise plays a major role in the play of speech
recognition.
Source (anon. (nd). Departement Elektrotechniek. Available:
http://www.esat.kuleuven.be/psi/spraak/theses/08-09-en/clp_lp_mask.png. Last accessed 22
September 2009)
Figure 8: Noise in Speech
Ultimate Speech Search
Page 33
According to Khan, E., and Levinson, R (1998) Speech recognition has achieved quite
remarkable progress in the past years.
 Many speech recognition systems are capable of producing very high recognition
accuracies (over 98%).
 But such recognition accuracy only applies for a quiet environment (very low noise)
and for speakers whose sample words were used during training.
Spectral subtraction and Weiner filtering are the two most popular methods that are available
in noise reduction because they are straight forward to implement.
1.8.1 Weiner filtering
Weiner filtering is a common model that applies for filtering noise. z(k), is a signal, s(k), plus
additive noise, n(k), that is uncorrelated with the signal z(k) = s(k) + n(k). If the noise is also
stationary then the power spectra of the signal and noise add 𝑃𝑧 𝑤 = 𝑃𝑠 𝑤 + 𝑃, 𝑤
1.8.2 Conclusion
Weiner filtering method has been chosen to the project because it is widely acceptable
method and so easy to implement.
Ultimate Speech Search
Page 34
2.9 Database and data structure
Database contains the text version of speeches and their location. Sample database maintains
in the hard disk and the locations are saved in a file. Database indexing used for efficient
search results.
Database indexing improves the speed of data structure. Indexing can be divided in to two
parts that is clustered and none clustered.
None clustered indexing doesn‟t bother about the order of the actual records. This results
additional input and output operations to get the actual results.
In clustering indexing it reorders data according to their indexes as data blocks. It‟s more
efficient for the searching purposes.
2.9.1 Conclusion
Clustered indexing has been chosen for the project because the system evolves search
operation for speeches.
Figure 9 : Database Indexing
Ultimate Speech Search
Page 35
2.10 Search Engine
Search engine basically act as the terminal for searching speeches and lectures. It will check
for search results in locally deployed database that contains the text version of speeches and
lectures. A search engine operates in the order of web crawling, indexing and searching.
Source(Sergy ,B. Lawrence,P.. (nd). The Anatomy of a Large-Scale Hypertextual Web Search
Engine. Available: http://infolab.stanford.edu/~backrub/google.html. Last accessed 24 march
2009.)
Figure 10 : Google Architecture
Ultimate Speech Search
Page 36
2.11 MATLAB
MATLAB was developed by MathWorks. MathWorks is a privately held multinational
company. They are specialized in technical software.
MATLAB is a multi platform fourth generation programming language. Just like other many
languages MATLAB supports the following features.
 Matrix manipulation
 Plotting of functions and data
 Algorithm implementation
 Create Graphical user interfaces
 Interfacing with other programming languages
Most of the MATLAB code snippets show a numerical nature. Regardless of that factor by
using MATLAB we can build systems in a more precise manner and the line of codes that
required buildings the system are relatively few compared with other languages such as
JAVA or C#.
Just like other object oriented languages MATALB supports classes, interfaces and functions.
They are used in high level MATLAB programming.
MATLAB directly supports both the Analogue and Digital Signal processing. It has defined a
set of rich features to work with Analogue and Digital Signal Processing. Signal transforms
and spectral analysis, digital system design, digital filtering, adaptive filtering, coding and
compression algorithms are the features which supports by MATLAB.
Ultimate Speech Search
Page 37
CHAPTER 3
ANALYSIS
3.0 System requirements
3.11 Functional requirements
1. The application must convert to speech or the lecture to a text format.
2. Converted text should be visible to the user.
3. If the speech or the lecture has noise it must be reduced in a manner that eligible for speech
recognition process.
4. Speeches with different accent need to be identified in a reasonable manner by the system.
5. The search results must be efficient and reliable.
3.1.2 Non functional requirements
1. Search algorithm need to be efficient.
2. Should not cater duplicate search results.
3. Should not take more time in searching.
4. Speech to text conversion must be efficient and accurate.
5. Noise reduction must maintain a fair performance.
Ultimate Speech Search
Page 38
3.1.3 Software Requirements
Java JDK 1.6:- JDK 1.6 equipped with the state of the art technology and includes much
functionality. Java Sound API newest version must be required.
NetBeans IDE 6.5:- is an open source IDE and it equipped with h PHP, JavaScript and Ajax
editing features, improved and the Java Persistence API, and tighter Glassfish v3 and MySQL
integration. It also facilitates features for the architectural drawing of the system. It also
equipped powerful J2EE components that are essential to build the search engine. We can
integrate any third party component that used for the system without much efforts and it has
the feature of code generation. Many non open source plugging supports this IDE.
Windows XP or equivalent operating system: - Windows XP operating system supports both
the open source components as well as commercialized components. We can deploy
everything that is essential for our project. Windows XP is a robust error less, user friendly
operating system compare to other windows operating system.
Apache tomcat server: version 5.5.27-Available in http://tomcat.apache.org/ is a freely
available server that we can run web programs on it. It is robust and open source. It has many
third party components that s essential to integrate stand alone, mobile, web based application
in to each other. This server comes with the NetBeans IDE.
XML Database:- This is the world‟s most popular database and its open source. It directly
supports for apache tomcat server and the NetBeans IDE and the crashing rate are lesser
compared to other databases with web services.
Proper sound driver software is required in order to achieve best results.
Matlab software required to perform noise filtering.
Ultimate Speech Search
Page 39
3.1.4 Hardware requirements
32 bit Intel Dual Core IV processor or greater:- concern about the development phase of the
project a massive amount of processing power is required as for the speech recognition and
for noise analysis , text to speech conversion and for the search. It is advisable to have high
end machine inured to prevent deadlocks.
 64 bit PCI sound card: - A high end sound card required to process digital audio signals.
 Minimum of 1 GB DDR3 RAM is required and 2 GB of virtual memory must be present
in the system.
 The default components of a personal computer must required
 A modem or a router is required in order to test the search between many users.
 1mb ADSL internet connection or greater is required for the data gathering.
 A microphone must need as for the future enhancements. So the users can store their own
speech and as for a future use any one can search any particular speeches and lectures.
20GB hard disk is required with 7200 or more rotation rate because the system going to maintain
the database in my machine.
Note: At least a Duel Core Processor is required because the speech recognition process
needs a massive processing power.
Ultimate Speech Search
Page 40
3.2 System Development Methodologies
All the methodologies compared in here were extended versions of previously commonly
used methodologies.
3.2.1 Rational Unified Process
Rational Unified process is a development methodology created by the rational software
division of IBM in 2003. It‟s an iterative system development process. RUP explains how
specific goals are achieved in a detailed manner.
RUP is a methodology of Managing Object Oriented software development. According to
Kroll and Kruchten (2003) “The RUP is a software development approach that is iterative,
architecture-centric, and use-case-driven.” RUP has extensible features and they are as
follows.
 Iterative Development
 Requirements Management
 Component-Based Architectural Vision
 Visual Modeling of Systems
 Quality Management
 Change Control Management
Ultimate Speech Search
Page 41
The figure below shows basic overviews of its phases.
Source: (anon. (nd). Department of Computer Science. Available:
people.cs.uchicago.edu/~matei/CSPP523/lect4.ppt. Last accessed 24 march 2009.)
Figure 11 Phases in RUP
Ultimate Speech Search
Page 42
Advantages of RUP
 It is a well-defined and well-structured software engineering process.
 It supports changing requirements and provides means to manage the change and
related risks
 It promotes higher level of code reuse.
 It reduces integration time and effort, as the development model is iterative.
 It allows the systems to run earlier than with other processes that essential for the
system.
 Risk management feature allows identifying risks before the development process.
 It has the unique feature that “Plan a little”, “design a little” and codes a little.
 RUP is an idea driven, principle based methodology.
 RUP methodology is a worldwide commercial standard.
Disadvantages of RUP
 For most of the projects RUP is an insufficient methodology.
 We need to customize the processes due to various situations.
 It has a poor usability support.
 The process in relatively complex and the weight age is high.
Ultimate Speech Search
Page 43
3.2.2 Agile Development Method
Agile development methodology is an iterative process. Agile has short time iterations and
due to that have minimum risk. The Agile software development methodology has the feature
of break tasks into small increments with minimal planning and it won‟t directly involve long
term planning. Agile highly supports for object oriented developing.
Most of all Agile has the unique feature called Extreme programming, now widely used in
software development process.
According to Ambler (2005) Agile is an iterative and incremental (evolutionary) approach to
software development which is performed in a highly collaborative manner by self-
organizing teams within an effective governance framework that produces high quality
software in a cost effective and timely manner which meets the changing needs of its
stakeholders.
Figure 12 : Overview of Agile
Ultimate Speech Search
Page 44
Advantages in Agile Software Development
 Increased Control
 Rapid Learning
 Early Return on Investment
 Satisfied Stakeholders
 Responsiveness to Change
Disadvantages in Agile Software Development
 Agile evolves heavy documentation.
 Agile Requirements are barely insufficient for the projects.
 Not an organized methodology.
 Because testing is integrated through the development the development cost is
relatively high.
 Too much user involvement may spoil the project.
Ultimate Speech Search
Page 45
3.2.3Scrum Development Methodology
According to Mikneus,s , S., Akinde, A. (2003)
 Scrum is an Agile Software Development Process.
 Scrum is not an acronym
 Name taken from the sport of Rugby, where everyone in the team pack acts together
to move the ball down the field
 Analogy to development is the team works together to successfully develop quality
software
According to Jeff Sutherland (2003) “Scrum assumes that the systems development process is
an unpredictable, complicated process that can only be roughly described as an overall
progression.” “Scrum is an enhancement of the commonly used iterative/incremental object-
oriented development cycle” Scrum principles include:
 Quality work: empowers everyone involved to be feeling good about their job.
 Assume Simplicity: Scrum is a way to detect and cause removal of anything that gets
in the way of development.
 Embracing Change: Team based approach to development where requirements are
rapidly changing.
 Incremental changes: Scrum makes these possible using sprints where a team is able
to deliver a product (iteration) deliverable within 30 days.
Ultimate Speech Search
Page 46
Advantages in Scrum
 Scrum has the ability to respond unseen software development risks
 It‟s a specialized process for commercial application development.
 It gives the developers of facility to deliver a functional application to the clients.
Disadvantages in Scrum
 Not suitable for researched based software developments.
Source[anon. (nd). anon. Available: http://www.methodsandtools.com/archive/scrum1.gif.
Last accessed 26 th March 2009.]
3.2.4 Conclusion
Agile software development methodology has been chosen for the development process
because it supports Object oriented development, has short iterations and supports Extreme
programming.
Figure 13 : Scrum Overview
Ultimate Speech Search
Page 47
3.3 Test Plan
The Systems main functionalities are noise analysis, speech recognition, database indexing
(directly effects to the search) and the search engine.
The system takes data (Speeches and lectures) from various conditions with lowered noise. But
the system cannot assure the effect of the noise factor. Due to that reason we perform noise
analysis and try to reduce it. Otherwise it will affect to the speech recognition process.
3.3.1System testing
Speeches and lectures with different accent [English only USA and British]:- In order to test
the speech recognition engines accuracy it will tested against different accents. The expected
results must be in a minimum difference with minimum errors.
Content Search:- when the user tries to search by the content by typing a word or a phrase
the appropriate search result will be displayed. The speech or the lecture containing the
specified words or the phrase will be displayed
Ultimate Speech Search
Page 48
CHAPTER 4
SYSTEM DESIGN
4.1 Use Case Diagram
The noise filters functionality implements separately from the speech recognition system.
Noise filtering system represents as the “Actor”.
Figure 14 : Use Case Diagram for System
Ultimate Speech Search
Page 49
The Figure above shows the Use Case Diagram for the entire system. The System mainly
consists with two actors. A user can uploads speech file in wav format to perform the speech
recognition.
Noise filtering handled by a separate system. The user has to upload a noisy speech file and
the noise filtering system will produce a file with lowered noise.
Ultimate Speech Search
Page 50
4.2 Use case description
4.2.1 Use case description for file upload
Use Case Use Case One
Description User uploads a file
Actors user
Assumptions User uploads a file in .wav format. The user has to upload a file without
noise.
Steps User has to run the system, press open button and have to select a file
Variations A user may uploads a file without noise or with noise,
Non functional
requirements
All the necessary hardware configuration must met.
Issues None
Table 4 : Use Case description file upload
Ultimate Speech Search
Page 51
4.2.2Use Case description for play an audio file
Use Case Use Case Two
Description User plays a .wav file
Actors User
Assumptions User can only play a file in wav format
Steps User has to open a file, and then the button play gets enabled. User has to
press the play button.
Variations No variations , only files in wav format can be played
Non functional
Requirements
All the necessary hardware configuration must met.
Issues None
Table 5 Use Case description play audio
Ultimate Speech Search
Page 52
4.2.3 Use Case description for search
Use Case Use Case Three
Description User search for a speech by content
Actors User
Assumptions User can search a speech by typing a sentences
Steps User has to run the speech search program. Type the thing he/she wants to
search for and presses search
Variations No variations
Non functional All the necessary hardware configuration must met.
Issues None
Table 6 Use Case description search
Ultimate Speech Search
Page 53
4.2.4 Use Case description for noise reduced output
Use Case Use Case Four
Description Noise reduction output produced by the system
Actors Noise filtering system
Assumptions Permanent elimination of the noise is unreachable.
User uploads a noisy file in a wav format
Steps
User has to run the noise filtering program in MATLAB
User has to input a file which includes the noise
Variations No variations
Non functional All the necessary hardware configuration must met.
Issues None
Table 7 Use Case description noise reduction
Ultimate Speech Search
Page 54
4.2.5 Use Case description for noise filtering
Use Case Use Case Five
Description The process of filtering noise
Actors Noise filtering system
Assumptions Permanent elimination of the noise is unreachable.
User uploads a noisy file in a wav format
The chosen mechanism for noise filtering is the most suitable one
Steps
User has to run the noise filtering program in MATLAB
User has to input a file which includes the noise
Variations No variations
Non functional All the necessary hardware configuration must met.
Issues None
Table 8 Use Case description noise process
Ultimate Speech Search
Page 55
4.3 Activity Diagrams
4.3.1Activity Diagram for Speech Recognition System
Figure 15 Speech Recognition
Ultimate Speech Search
Page 56
4.3.2 Activity Diagram for Noise filtering
Figure 16 Activity Diagram Noise Filtering
Ultimate Speech Search
Page 57
4.4 Sequence Diagrams
4.4.1 Select a file
Figure 17 Sequence Diagram Select a file
Ultimate Speech Search
Page 58
4.4.2 Play wav file
The system can play a file. Two main control classes involve this process. The
WavFileRecognition class acts as a mediator which passes messages between functionalities
on other classes.
Figure 18 Sequence Diagram Play File
Ultimate Speech Search
Page 59
4.4.3Speech recognition pre stage
In Speech recognition pre stage, the system gets loaded with the configuration file and input
signal. A recognizer will allocate through the configuration manager.
Figure 19 Sequence Diagram SR Pre Stage
Ultimate Speech Search
Page 60
4.4.4Speech Recognition post stage
In speech recognition post stage the input digital signal will go through fast Fourier
transformation segmenting, identifying dialects and phonemes. The Classes
AudioFileDataSource and the Recognizer facilitates functionalities to perform these tasks.
Figure 20 Sequence Diagram SR Post Stage
Ultimate Speech Search
Page 61
4.5 Class Diagrams
4.5.1 GUI and the system
The figure below shows the class diagram of the GUI and WavFileRecognizer.
Figure 21 Class Diagrams GUI & System
Ultimate Speech Search
Page 62
4.5.2 Speech recognition
Figure 22 Class Diagram SR System
Ultimate Speech Search
Page 63
Class Diagram for Speech search
Figure 23 : Speech Search Class Diagram
Ultimate Speech Search
Page 64
4.6 Noise Filtering
Noise filtering has done using Matlab. Matlab support objects orientation, polymorphism
or inheritance. I have generated a code in C to tally the code in Matlab.
%ver 1.56
function noiseReduction
%----- user data -----
steps_1 = 512;
chunk = 2048;
coef = 0.01*chunk/2;
The 3 code segments above defines user data which going to use in MATLAB script. The
term chunk means a small piece of segment of the input signal. The script below can be used
to filter the noise for any given input signal.
%Windowing Techniques
%w1 = .5*(1 - cos(2*pi*(0:chunk-1)'/(chunk))); %hanning
w1 = [.42 - .5*cos(2*pi*(0:chunk-1)/(chunk-1)) + .08*cos(4*pi*(0:chunk-1)/(chunk-1))]';
%Blackman
w2 = w1;
Backman Window technique used here to chop the signal in to small segments. In here the
input signal will recursively split in to small chunks. Chunk is the technical term for a
segment in digital signal processing.
% input wav file and extract required data
[input, FS, N] = wavread('input.wav');
L = length(input);
The input signal will extract and re arrange in to a matrix. Length is the total propagating
duration of the signal. The matrix mechanism hidden by the MATLAB.
Ultimate Speech Search
Page 65
% zero padding for intput file
input = [zeros(chunk,1);input;zeros(chunk,1)]/ max(abs(input));
%the appended zeros to the back of the input sound file makes it so that the windowing
samples the complete sound file
%----- initializations -----
output = zeros(length(input),1);
count = 0;
% block by block fft algorithm
Normally a noise signal has a higher frequency. After the system gets median value for
noise factor. The functions below recursively take segments and analyze the mean value.
while count<(length(input) - chunk)
grain = input(count+1:count+chunk).* w1; % windowing
f = fft(grain); % fft of window data
r = abs(f); % magnitude of window data
phi = angle(f); % phase of window data
ft = denoise(f,r,coef);
This function will reduce the amplitude of each chunk. A single chunk will take as an
argument by the function.
grain = real(ifft(ft)).*w2; % take inverse fft of window data
output(count+1:count+chunk) = output(count+1:count+chunk) + grain; % append
data to output file
count = count + steps_1; % increment by hop size
end
output = output(1:L) / (4.75*max(abs(output))); %the 4.75*max(abs(output) maintains
consistency between input and output volume
%soundsc(output, FS);
wavwrite(output, FS, 'output.wav');
As you can see there are no classes or Interfaces. Equivalent code for the Matlab in C
programming language is shown below.
Ultimate Speech Search
Page 66
function ft = denoise(f,r,coef)
if abs(f) >= 0.001
ft = f.*(r./(r+coef));
else
ft = f.*(r./(r+sqrt(coef)));
end
The shown above is denoise function. The function analyzes each signal chunk‟s absolute
frequency against its mean value. Then it will get modified by the coefficient and the square
root recursively. This process continues till the higher frequency clusters eliminates to lower
frequencies.
Ultimate Speech Search
Page 67
4.7 Code to filter noise in C Language
#include <stdio.h>
#include "mclmcr.h"
#ifdef __cplusplus
extern "C" {
#endif
extern const unsigned char __MCC_denoise2_public_data[];
extern const char *__MCC_denoise2_name_data;
extern const char *__MCC_denoise2_root_data;
extern const unsigned char __MCC_denoise2_session_data[];
extern const char *__MCC_denoise2_matlabpath_data[];
extern const int __MCC_denoise2_matlabpath_data_count;
extern const char *__MCC_denoise2_mcr_runtime_options[];
extern const int __MCC_denoise2_mcr_runtime_option_count;
extern const char *__MCC_denoise2_mcr_application_options[];
extern const int __MCC_denoise2_mcr_application_option_count;
#ifdef __cplusplus
}
#endif
static HMCRINSTANCE _mcr_inst = NULL;
static int mclDefaultPrintHandler(const char *s)
{
return fwrite(s, sizeof(char), strlen(s), stdout);
}
static int mclDefaultErrorHandler(const char *s)
{
int written = 0, len = 0;
len = strlen(s);
written = fwrite(s, sizeof(char), len, stderr);
if (len > 0 && s[ len-1 ] != 'n')
written += fwrite("n", sizeof(char), 1, stderr);
return written;
}
bool denoise2InitializeWithHandlers(
mclOutputHandlerFcn error_handler,
mclOutputHandlerFcn print_handler
)
{
if (_mcr_inst != NULL)
return true;
Ultimate Speech Search
Page 68
if (!mclmcrInitialize())
return false;
if (!mclInitializeComponentInstance(&_mcr_inst,
__MCC_denoise2_public_data,
__MCC_denoise2_name_data,
__MCC_denoise2_root_data,
__MCC_denoise2_session_data,
__MCC_denoise2_matlabpath_data,
__MCC_denoise2_matlabpath_data_count,
__MCC_denoise2_mcr_runtime_options,
__MCC_denoise2_mcr_runtime_option_count,
true, NoObjectType, ExeTarget, NULL,
error_handler, print_handler))
return false;
return true;
}
bool denoise2Initialize(void)
{
return denoise2InitializeWithHandlers(mclDefaultErrorHandler,
mclDefaultPrintHandler);
}
void denoise2Terminate(void)
{
if (_mcr_inst != NULL)
mclTerminateInstance(&_mcr_inst);
}
int main(int argc, const char **argv)
{
int _retval;
if (!mclInitializeApplication(__MCC_denoise2_mcr_application_options,
__MCC_denoise2_mcr_application_option_count))
return 0;
if (!denoise2Initialize())
return -1;
_retval = mclMain(_mcr_inst, argc, argv, "denoise2", 0);
if (_retval == 0 /* no error */) mclWaitForFiguresToDie(NULL);
denoise2Terminate();
mclTerminateApplication();
return _retval; }
/*
* MATLAB Compiler: 4.0 (R14)
* Date: Sun Oct 04 09:55:11 2009
* Arguments: "-B" "macro_default" "-m" "-W" "main" "-T" "link:exe" "denoise2"
Ultimate Speech Search
Page 69
*/
#ifdef __cplusplus
extern "C" {
#endif
const unsigned char __MCC_denoise2_public_data[] = {'3', '0', '8', '1', '9',
'D', '3', '0', '0', 'D',
'0', '6', '0', '9', '2',
'A', '8', '6', '4', '8',
'8', '6', 'F', '7', '0',
'D', '0', '1', '0', '1',
'0', '1', '0', '5', '0',
'0', '0', '3', '8', '1',
'8', 'B', '0', '0', '3',
'0', '8', '1', '8', '7',
'0', '2', '8', '1', '8',
'1', '0', '0', 'C', '4',
'9', 'C', 'A', 'C', '3',
'4', 'E', 'D', '1', '3',
'A', '5', '2', '0', '6',
'5', '8', 'F', '6', 'F',
'8', 'E', '0', '1', '3',
'8', 'C', '4', '3', '1',
'5', 'B', '4', '3', '1',
'5', '2', '7', '7', 'E',
'D', '3', 'F', '7', 'D',
'A', 'E', '5', '3', '0',
'9', '9', 'D', 'B', '0',
'8', 'E', 'E', '5', '8',
'9', 'F', '8', '0', '4',
'D', '4', 'B', '9', '8',
'1', '3', '2', '6', 'A',
'5', '2', 'C', 'C', 'E',
'4', '3', '8', '2', 'E',
'9', 'F', '2', 'B', '4',
'D', '0', '8', '5', 'E',
'B', '9', '5', '0', 'C',
'7', 'A', 'B', '1', '2',
'E', 'D', 'E', '2', 'D',
'4', '1', '2', '9', '7',
'8', '2', '0', 'E', '6',
'3', '7', '7', 'A', '5',
'F', 'E', 'B', '5', '6',
'8', '9', 'D', '4', 'E',
'6', '0', '3', '2', 'F',
'6', '0', 'C', '4', '3',
Ultimate Speech Search
Page 70
'0', '7', '4', 'A', '0',
'4', 'C', '2', '6', 'A',
'B', '7', '2', 'F', '5',
'4', 'B', '5', '1', 'B',
'B', '4', '6', '0', '5',
'7', '8', '7', '8', '5',
'B', '1', '9', '9', '0',
'1', '4', '3', '1', '4',
'A', '6', '5', 'F', '0',
'9', '0', 'B', '6', '1',
'F', 'C', '2', '0', '1',
'6', '9', '4', '5', '3',
'B', '5', '8', 'F', 'C',
'8', 'B', 'A', '4', '3',
'E', '6', '7', '7', '6',
'E', 'B', '7', 'E', 'C',
'D', '3', '1', '7', '8',
'B', '5', '6', 'A', 'B',
'0', 'F', 'A', '0', '6',
'D', 'D', '6', '4', '9',
'6', '7', 'C', 'B', '1',
'4', '9', 'E', '5', '0',
'2', '0', '1', '1', '1'
, '0'};
const char *__MCC_denoise2_name_data = "denoise2";
const char *__MCC_denoise2_root_data = "";
const unsigned char __MCC_denoise2_session_data[] = {'7', '7', 'B', 'D', '1',
'6', '2', '3', '5', '5',
'4', '5', '0', 'A', 'B',
'1', '7', '3', '9', '0',
'4', 'D', '4', '6', '7',
'2', 'E', '3', '6', 'B',
'3', '2', '4', '7', '5',
'6', '1', '0', 'F', '3',
'5', '2', '8', 'D', '5',
'3', '8', '2', '3', '4',
'4', 'A', '6', 'B', '6',
'3', '8', 'E', '4', 'E',
'A', '8', '2', 'F', '9',
'4', '1', '8', 'E', '9',
'1', 'C', '1', 'F', '8',
'F', '7', '6', '0', '2',
'D', 'B', '3', 'B', 'F',
'3', '4', '9', 'B', 'C',
Ultimate Speech Search
Page 71
'2', '8', 'C', '6', 'A',
'9', '9', '6', '4', '9',
'6', '3', 'C', '6', '8',
'4', '1', '1', '8', '5',
'5', 'E', '2', '3', '5',
'B', '9', '7', '9', '7',
'0', '9', 'B', 'A', 'F',
'7', 'E', 'D', '0', 'C',
'0', '5', 'F', 'E', '2',
'C', '6', '3', '6', '6',
'D', 'F', 'B', '6', '0',
'F', '6', 'B', 'F', 'F',
'2', '9', '4', '4', '2',
'0', '3', 'C', 'C', 'C',
'8', 'E', '3', '7', 'F',
'A', '4', '5', 'A', '9',
'A', '5', 'B', '7', '2',
'0', '0', 'B', 'E', '3',
'F', 'E', '0', 'E', 'B',
'1', 'C', '0', '7', 'D',
'3', '9', 'D', 'F', '0',
'7', '4', '2', 'B', '9',
'E', '3', 'A', '2', 'F',
'3', '3', 'E', '9', '8',
'E', '5', 'C', '9', 'B',
'B', 'D', '3', '6', 'B',
'7', 'D', 'E', '8', '3',
'2', 'B', '9', '7', '5',
'F', '3', '0', '7', '7',
'D', 'F', '8', '1', 'F',
'A', '9', 'B', '4', 'F',
'E', '3', '5', '4', 'F',
'B', '1', '8', 'E', '1',
'D', '0'};
const char *__MCC_denoise2_matlabpath_data[] = {"denoise2/",
"toolbox/compiler/deploy/",
"$TOOLBOXMATLABDIR/general/",
"$TOOLBOXMATLABDIR/ops/",
"$TOOLBOXMATLABDIR/lang/",
"$TOOLBOXMATLABDIR/elmat/",
"$TOOLBOXMATLABDIR/elfun/",
"$TOOLBOXMATLABDIR/specfun/",
"$TOOLBOXMATLABDIR/matfun/",
"$TOOLBOXMATLABDIR/datafun/",
"$TOOLBOXMATLABDIR/polyfun/",
Ultimate Speech Search
Page 72
"$TOOLBOXMATLABDIR/funfun/",
"$TOOLBOXMATLABDIR/sparfun/",
"$TOOLBOXMATLABDIR/scribe/",
"$TOOLBOXMATLABDIR/graph2d/",
"$TOOLBOXMATLABDIR/graph3d/",
"$TOOLBOXMATLABDIR/specgraph/",
"$TOOLBOXMATLABDIR/graphics/",
"$TOOLBOXMATLABDIR/uitools/",
"$TOOLBOXMATLABDIR/strfun/",
"$TOOLBOXMATLABDIR/imagesci/",
"$TOOLBOXMATLABDIR/iofun/",
"$TOOLBOXMATLABDIR/audiovideo/",
"$TOOLBOXMATLABDIR/timefun/",
"$TOOLBOXMATLABDIR/datatypes/",
"$TOOLBOXMATLABDIR/verctrl/",
"$TOOLBOXMATLABDIR/codetools/",
"$TOOLBOXMATLABDIR/helptools/",
"$TOOLBOXMATLABDIR/winfun/",
"$TOOLBOXMATLABDIR/demos/",
"toolbox/local/",
"toolbox/compiler/"};
const int __MCC_denoise2_matlabpath_data_count = 32;
const char *__MCC_denoise2_mcr_application_options[] = { "" };
const int __MCC_denoise2_mcr_application_option_count = 0;
const char *__MCC_denoise2_mcr_runtime_options[] = { "" };
const int __MCC_denoise2_mcr_runtime_option_count = 0;
#ifdef __cplusplus
}
#endif
Ultimate Speech Search
Page 73
CHAPTER 5
5.0 Implementation
The Agile development process was chosen for the development. The system went on three
iterations. In the first iteration the basic objective was to build a speech recognition engine.
Various methods were tested out. But in the first iteration the speech recognition engine was
built.
Figure 24: SR Engine
Ultimate Speech Search
Page 74
The figure below shows the functionalities in speech recognition engine. It can open .wav file
to play or to recognize speech.
Figure 25 Open file
Ultimate Speech Search
Page 75
Once a file selected for the recognition a user can press the start button to start the
recognition process. The recognized output can be viewed in the text output section.
Figure 26: Text output
Ultimate Speech Search
Page 76
The noise filtering process has done in the second iteration and it‟s completely done by using
MATLAB.
It doesn‟t have a user interface. In the first development the noise filtering engine was not
that efficient. There were many isolated noise packets in the spectrum. But in the second
development the system could achieve a remarkable performance.
We have to input a noisy speech file and when we runs the program it will produced a noise
filtered .wav file.
Ultimate Speech Search
Page 77
The Search engine was built on the third phase. The user has to run the search engine and it
will access the local database and gives the search results.
Figure 27 Speech Search Engine
Ultimate Speech Search
Page 78
CHAPTER 6
6.0 Test Plan
6.1 Background
The system that built for the research project was comprises with three main parts. The
speech recognition section is the key part in this application. The noise filtering part section is
another key are that taken in to accounts. There‟s a text search in the system which provides
the facility to search the speech by content. Because this was a technical project and with
consideration of the nature of the projects, the testing criteria‟s would not looks the same
compared with other projects.
6.2 Introduction
As for the test plan the testing criteria‟s will based on the input speech signals for the speech
recognition and noise filtering and searching criteria. Due to nature of this project we cannot
make the use of industrial test plans. The project is not a commercial project. As for the
speech recognition testing criteria a speech in a digital format will use. Speech recognition
projects are still in the research stage. So it‟s not advisable to implement a standard heavy
weight test plan. Basic test plans will sufficient to asses the testing criteria‟s mentioned in the
project.
Ultimate Speech Search
Page 79
6.3 Assumptions
Before declaring any assumptions it is advisable to understand the nature of the project.
Within the project scope we can assume that the speech recognition engine will only works
for noiseless speech inputs. The speech recognition system will only work on pure English
accent only. Noisy speech will not use as inputs because the speech engine won‟t directly
identify the noise factor and filters it.
The system only can identify the most speaking words. It is possible to add large vocabulary.
Due to the fact, the system haven‟t designed for high level language identification and
processing.
Noise filtering can be done on “.wav” format only. System cannot eliminate the noise factor
permanently.
It is not possible to use a file which have been filtered the noise for the recognition, because
the speech recognition system will works on noiseless accent only.
6.4 Features to be tested
For the speech recognition system a noiseless speech input in .wav format will be tested to
identify the continuous speech recognition capabilities. Continuous speech recognition
capability is a unique feature in modern speech recognition systems.
A noisy speech file will upload to noise filtering system and it will results a noise filtered [up
to a reasonable level] output file. It is possible to measure the efficiency of the noise filtering
system by measuring the amount of time it will take for processing. It is not addressed here in
the project.
For the speech searching part the system will use a file search. The search mechanism will
include an efficient file searching and text matching mechanism. Once the user typed for a
phrase, the system will show the mostly containing file name.
System can play a wav file before uploading for the recognition process.
Ultimate Speech Search
Page 80
6.5 Suspension and resumption criteria
While the system testing process running and if there are defects there are reasons to
suspense the process. Suspension criteria denote what are those reasons. According to Anon.
(nd). Suspension criteria & resumption requirements
The suspension criteria as follows
 Unavailability of external dependent systems during execution.
 When a defect is introduced that cannot allow any further testing.
 Critical path deadline is missed so that the client will not accept delivery even if all
testing is completed.
 A specific holiday shuts down both development and testing
The resumption criteria‟s as follows
 When the external dependent systems become available again.
 When a fix is successfully implemented and the Testing Team is notified to continue
testing.
 The contract is renegotiated with the client to extend delivery.
 The holiday period ends.
According Anon. (nd). Suspension criteria & resumption requirements
Suspension criteria assume that testing cannot go forward and that going backward is also not
possible. A failed build would not suffice as you could generally continue to use the previous
build. Most major or critical defects would also not constituted suspension criteria as other
areas of the system could continue to be tested.
Ultimate Speech Search
Page 81
6.6 Environmental needs
There are few environmental needs to be met before testing the system. The environmental
needs can be classified as software needs, hardware needs and legal needs. There are no legal
needs because the system does not have any links with legal situations.
The list of Software needs can be list down as below
 Java run time environment
 Matlab development software
 NetBeans 6.5 or greater
 Sound driver software
 Windows XP operation system
The hardware needs are
 A computer[hardware requirements were specified in another chapter under system
requirements]
 Multimedia devices
Ultimate Speech Search
Page 82
6.7 System testing
Speeches and lectures with different accent [English only USA and British]:- In order to test
the speech recognition engines accuracy it will tested against different accents. The expected
results must be in a minimum difference with minimum errors.
Content Search:- when the user tries to search by the content by typing a word or a phrase
the appropriate search result will be displayed. The speech or the lecture containing the
specified words or the phrase will be displayed
Ultimate Speech Search
Page 83
6.8 Unit testing
The initial testing was the initial user interface. At the first glance the system only loads with
the basic interactions with the user. The system doesn‟t load any calculation or extraction
functionalities before a user provides a correct input for the system.
Test Case Test Case One
Description The user runs the Speech recognition System for the first time
Expected Output Open, Start and Open Speech buttons set enabled.
Encode To wav, Noise Filter buttons remain disabled.
The area below open a speech file shows blank.
Text output must show blank.
Actual Output Open, Start and Open Speech buttons set enabled.
Encode To wav, Noise Filter buttons remain disabled.
The area below open a speech file shows blank.
Text output must show blank.
Actual output acquired.
Table 9 Test Case 1
On the initial run the speech recognition system won‟t load with any algorithms. After giving
an input the system will load the necessary components for processing. This mechanism will
utilize the system resources.
Ultimate Speech Search
Page 84
The second testing criteria begin when the user provides and input to the system. This test
case interacts with the speech recognition system‟s input. The input can be a .wav file.
Test Case Test Case Two
Description The user opens a file to feed the speech recognition system
The user provides for the system with .wav file.
The first input speech contains digits in the range of one to nine in
British accent.
File must be a noise free file.
Expected Output Identified names of the digits needs to be display in text output area.
Actual Output Due to variations in dialect the expected results would not the same.
Within the range of one to nine the system identifies the digits and
displays the output.
Table 10 Test Case 2
The identification of digits can be extending beyond ten. Once the name of the digit to be
identified becomes longer, the system identifies the digits with an error rate.
Ultimate Speech Search
Page 85
The third testing criterion was based on the user inputs a file with noise for identification.
The system does not work for files contains with noise.
Test Case Test Case Three
Description The user provides for the system with a .wav file with noise
Expected Output The system will throws an error or the system shows no results
Actual Output The actual output varies due to different noise levels. If the density of
the noise lays within a higher range the system go for an error. The
error can be “severe null”.
The system will go blank results due to the fact that the words are
merely in an identifiable stage.
Table 11 Test Case 3
The system doesn‟t have any functionality to measure the noise levels. The project scope
won‟t cater for in depth noise analysis. The levels of noise mentioned above were measured
in user experience.
The system assumes that the users would not upload files with noise to the system and this
rule clearly mentioned in assumptions.
Ultimate Speech Search
Page 86
The fourth testing c criterion is to check the systems speech recognition capabilities with
words.
Test Case Test Case Four
Description The user provides for the system with a .wav file containing basic
words.
The input doesn‟t contain any noise.
Expected Output The system identified all the words and shows the output in a more
precise manner.
Actual Output The system identified words with an error rate. The error rate is
fluctuates between from 20% to 35%.
Not all the words will identify by the system.
Table 12 Test Case 4
The System doesn‟t identify all the words. The identification process depends on the speed of
the utterance rate and the intensity of the phoneme. Higher intensities on phoneme help any
speech recognition systems to achieve more precise results.
Ultimate Speech Search
Page 87
Test case five tests the performance of noise filtering. The noise filtering system was built in
MATLAB.
Test Case Test Case Five
Description The user provides the noise filtering system with a noise file.
The input file must in .wav format.
The user has to open the MATLAB Scripts, import them to working
directory and need to run.
The file to be input need to be in the same directory.
Expected Output An output file should be create in the working folder with the name
“output.wav”
“output.wav” file contains the noise filtered version of input file.
Amplitude of the output file should not have a difference which can
identify by a human.
Actual Output Output file creates in working folder.
Output file has a lowered noise relative to the input file.
Output file is not noise free.
Amplitude has a different which can identify by a human ear.
Table 13 Test Case 5
Still there isn‟t a mechanism to remove the noise for 100%. The system will works on
predefined algorithms.
Ultimate Speech Search
Page 88
Test case six tests the criterion for search functionality. The search functionality acts as a
speech search engine.
Test Case Test Case Six
Description The user has to run the search engine.
Port 8080 must be free.
Expected Output When user types a phrase to search on search engine and press search
button.
If there‟s a match in the database it will show true.
If there‟s no match the results will show as false.
Actual Output If a match was found “true” displays in the results.
If no match “false” displays in results.
Table 14 Test Case 6
The system doesn‟t build for actual speech engine. It will only demonstrate how the speech
search engine works. As for future enhancements it‟s possible to build an actual search
engine.
Ultimate Speech Search
Page 89
6.9 Performance Testing
The System‟s performance was tested in different operating systems. Operating systems
include virtual operating environments.
The absolute operating system in order to take the measurements was taken as the Microsoft
Windows XP.
Operating System Microsoft Windows XP
Speech recognition engine configuring
time
Between 0.5 seconds and 1 second
Efficiency of Speech recognition Input signal which having greater phoneme
intensity, free from noise and duration less
than 10 seconds with low word density will
take around 1 second to 12 seconds.
Input signals which have many words will
take longer times.
Efficiency of Noise filtering and
MATLAB
Noise filtering system generates the output
less than 200 milliseconds for .wav file clips
which having a duration between 2 to 10
seconds.
Performance of Speech search engine Startup time for the Speech Search has an
average of 8 to 15 seconds.
Table 15: Performance testing windows XP
Ultimate Speech Search
Page 90
The performance of the speech search engine totally depends on the operating system. As for
an example the windows operation systems use much more resources than UNIX based
operating systems.
The speech search system runs on the Glassfish server. The glassfish server has more
performance in UNIX based operating systems. In windows environments the speech search
engine has many deadlocks.
Operating System Ubuntu 9.04
Speech recognition engine configuring
time
Between 0.2 seconds and 0.8 second
Efficiency of Speech recognition Input signal which having greater phoneme
intensity, free from noise and duration less
than 5 seconds with low word density will
take around 1 second to 5 seconds.
System has a greater positive effect when it‟s
work on Ubuntu environments.
Efficiency of Noise filtering and
MATLAB
Noise filtering was efficient compared with
windows environment.
Performance of Speech search engine The search and the startup time of the search
engine were efficient compared with
windows XP.
Table 16 : Performance Testing on UBUNTU
Ultimate Speech Search
Page 91
Once the Search engine runs on many times in windows environment it has a higher potential
of crashing and it would not provide the correct results.
When the system uses to perform the speech recognition for several times the efficiency of
the recognition slows down.
Java runs on a virtual environment and the recognition process needs a higher processing
power. Due to those factors the efficiency of the system will degrade as it uses over and over.
Ultimate Speech Search
Page 92
6.10 Integration Testing
Integration testing is a logical extension of unit testing. Integration testing identify drawback
when a system combines. Before performing integration System for the system it comprises
with different systems with different functionalities.
It is not possible to combine the noise filtering system with speech recognition system or
search web browser,.
An overall test mechanism used for integration testing due to the fact that the system
comprises with sub systems which indirectly have a connection with each other.
Big Bang testing
Big Band testing is the process of taking the entire unit testing criteria for a System and ties
them together. This approach mostly suitable for small systems and May results many
unidentified errors on testing stages. If a developer has done unit testing correctly, Bug bang
testing will helps to uncover more errors and it will save money and time.
In the system after performing the big bang testing the following faults were recovered.
 The continuous functionality of the search engine cannot guaranty.
 If the length of the input signal was long there will be a system out of memory error.
The disadvantages in Big bang testing are
 Cannot start integration testing until all the modules have been successfully
evaluated.
 Harder to track down the causes for the errors
Ultimate Speech Search
Page 93
Incremental Testing
Incremental Testing allows you to compare and contrast two functionalities with you are
testing. You can add and test for other modules within the testing time.
Incremental testing cannot perform to the system because there are no parallel functionalities
within the system which interact each others.
Ultimate Speech Search
Page 94
CHAPTER 7
CRITICAL EVALUATION AND FUTURE ENHANCEMENTS
7.1Critical evaluation
The entire project was about speech recognition using digital signal by input and search by
content. The project is a union of several other research areas. At the initial stage the research
was focused on to speech recognition.
The barriers met in the initial stage
 Human speech recognition
At the beginning there wasn‟t a way to explain the speech recognition process,
the mechanisms behind that and how it was performed.
 Speech recognition engine
Study of speech recognition engine was a crucial part for the design phase.
There was no speech recognition engine to analyze or to study.
In order to overcome those two factors first of all the functionality of speech recognition was
essential. After understanding a system and completion of a basic sketch of the flow
diagram, it seems a sufficient starting point for the development.
When talks over a microphone, it so easy to record a human voice. After recording the
human voice is no longer in analogue format. The obvious digital format was a .wav file. The
system going to performs the speech recognition for the file in .wav format.
Ultimate Speech Search
Page 95
In the development phase the study of speech recognition system didn‟t help much for further
proceedings. Because when it comes to audio formats the digital signal processing part was
hidden. The system has to address the DSP part in a reasonable manner. Digital signal
processing is about concerned with the representation of the signals by a sequence of
numbers or symbols and the processing of these signals.
Within the course content we studied there wasn‟t a single module that tough us about
Interface programming, micro control programming or digital signal processing.
Building a functionality to handle the digital signal processing part from the scratch was a
tedious job. The knowledge that we have to build such functionality wont sufficient
compared with the time.
At the initial stage the plan was to develop the entire system in JAVA .but java didn‟t have a
built in proper API or luxuries to handle digital signal processing. However there were some
reliable third party components that merely manage to perform the task.
Plug in the third party tools was another issue. But finally manage to find codes in order to
accomplish the task.
There was few speech recognition systems were built using java. But the fact that they were
not built for continuous speech recognition or for noise reduction.
There were many issues in the first place. We have to define a grimmer format. There were
two options. One is to go with JVXML. Java voice xml is a technology which provides
speech synthesis capabilities and recognition capabilities. We can embed voice commands for
web sites using voice xml.
As for the project I have choose JSGF or JAVA speech and grammar format. Java speech and
grammar format supports inbuilt dictionaries which capable to support digits and words. We
can plug multi language capabilities.
When developing systems using JAVA it‟s always advisable to use the components that
easily support JAVA platform capabilities.
Ultimate Speech Search
Page 96
The speech recognition system can split in to two systems. The Java virtual machine allocates
maximum of 128m memories for NetBeans. We cannot explicitly define the amount of the
virtual machine when we are working with net beans.
The digits recognition part can be performed and implemented in NetBeans development
environment.
But it‟s not possible to free the memory for recognize speech that contain words. They need a
higher level of virtual memory from the virtual machine. Because of that the speech
recognition for words had to run in Command prompt explicitly saying “java –mx256m -
jar”. This command allocated 256m virtual memory for speech recognition.
Noise filtering was another unsolved issue that had to answer through the system. For noise
filtering there was no proper support in JAVA. If you are doing a technical project its
essential to develop in 4th
generation languages or languages like C , assembly .
Speech Recognition , Noise Filtering and  Content Search Engine , Research Document
Speech Recognition , Noise Filtering and  Content Search Engine , Research Document
Speech Recognition , Noise Filtering and  Content Search Engine , Research Document
Speech Recognition , Noise Filtering and  Content Search Engine , Research Document
Speech Recognition , Noise Filtering and  Content Search Engine , Research Document
Speech Recognition , Noise Filtering and  Content Search Engine , Research Document
Speech Recognition , Noise Filtering and  Content Search Engine , Research Document
Speech Recognition , Noise Filtering and  Content Search Engine , Research Document
Speech Recognition , Noise Filtering and  Content Search Engine , Research Document
Speech Recognition , Noise Filtering and  Content Search Engine , Research Document
Speech Recognition , Noise Filtering and  Content Search Engine , Research Document
Speech Recognition , Noise Filtering and  Content Search Engine , Research Document
Speech Recognition , Noise Filtering and  Content Search Engine , Research Document
Speech Recognition , Noise Filtering and  Content Search Engine , Research Document
Speech Recognition , Noise Filtering and  Content Search Engine , Research Document
Speech Recognition , Noise Filtering and  Content Search Engine , Research Document
Speech Recognition , Noise Filtering and  Content Search Engine , Research Document
Speech Recognition , Noise Filtering and  Content Search Engine , Research Document

More Related Content

Viewers also liked

Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks
Chiranjeevi Adi
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
ankit_saluja
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
Diptimaya Sarangi
 
Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...
Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...
Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...
gt_ebuddy
 
Voice based web browser
Voice based web browserVoice based web browser
Voice based web browser
Gowsalyasri
 
Hand gesture recognition system(FYP REPORT)
Hand gesture recognition system(FYP REPORT)Hand gesture recognition system(FYP REPORT)
Hand gesture recognition system(FYP REPORT)
Afnan Rehman
 
Blackboard Pattern
Blackboard PatternBlackboard Pattern
Blackboard Pattern
tcab22
 
blackboard architecture
blackboard architectureblackboard architecture
blackboard architecture
Nguyễn Ngân
 
document for Voice banking system mini project
document for Voice banking system mini projectdocument for Voice banking system mini project
document for Voice banking system mini project
Jal Pari
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
Arif A.
 

Viewers also liked (20)

Speech recognition
Speech recognitionSpeech recognition
Speech recognition
 
Blackboard architecture pattern
Blackboard architecture patternBlackboard architecture pattern
Blackboard architecture pattern
 
Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By Matlab
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
 
Artificial intelligence Speech recognition system
Artificial intelligence Speech recognition systemArtificial intelligence Speech recognition system
Artificial intelligence Speech recognition system
 
Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...
Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...
Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...
 
Voice based web browser
Voice based web browserVoice based web browser
Voice based web browser
 
Hand gesture recognition system(FYP REPORT)
Hand gesture recognition system(FYP REPORT)Hand gesture recognition system(FYP REPORT)
Hand gesture recognition system(FYP REPORT)
 
Blackboard Pattern
Blackboard PatternBlackboard Pattern
Blackboard Pattern
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
blackboard architecture
blackboard architectureblackboard architecture
blackboard architecture
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
document for Voice banking system mini project
document for Voice banking system mini projectdocument for Voice banking system mini project
document for Voice banking system mini project
 
Open shift online
Open shift onlineOpen shift online
Open shift online
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
Power point
Power pointPower point
Power point
 
Automatic Speech Recognition
Automatic Speech RecognitionAutomatic Speech Recognition
Automatic Speech Recognition
 

Similar to Speech Recognition , Noise Filtering and Content Search Engine , Research Document

Documentation 140612091527-phpapp01
Documentation 140612091527-phpapp01Documentation 140612091527-phpapp01
Documentation 140612091527-phpapp01
Mahesh Jadhav
 
Design and Development of a Knowledge Community System
Design and Development of a Knowledge Community SystemDesign and Development of a Knowledge Community System
Design and Development of a Knowledge Community System
Huu Bang Le Phan
 
PD 2.0 - Getting your degree and beyond
PD 2.0 - Getting your degree and beyondPD 2.0 - Getting your degree and beyond
PD 2.0 - Getting your degree and beyond
Carla Raguseo
 
Ontology Based Opinion Mining for Book reviews
Ontology Based Opinion Mining for Book reviewsOntology Based Opinion Mining for Book reviews
Ontology Based Opinion Mining for Book reviews
firzhan naqash
 

Similar to Speech Recognition , Noise Filtering and Content Search Engine , Research Document (20)

Project Report
Project ReportProject Report
Project Report
 
Competitive review
Competitive reviewCompetitive review
Competitive review
 
Documentation 140612091527-phpapp01
Documentation 140612091527-phpapp01Documentation 140612091527-phpapp01
Documentation 140612091527-phpapp01
 
ERP on School Management System
ERP on School Management SystemERP on School Management System
ERP on School Management System
 
Open source CMS tool for web based job portal and recruitment system Thesis
Open source CMS tool for web based job portal and recruitment system ThesisOpen source CMS tool for web based job portal and recruitment system Thesis
Open source CMS tool for web based job portal and recruitment system Thesis
 
Thelmas grocery store Inventory & POS
Thelmas grocery store Inventory & POSThelmas grocery store Inventory & POS
Thelmas grocery store Inventory & POS
 
Hilary's Capstone
Hilary's CapstoneHilary's Capstone
Hilary's Capstone
 
Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...
 
The Research Software Encyclopedia
The Research Software EncyclopediaThe Research Software Encyclopedia
The Research Software Encyclopedia
 
The art of intranet search
The art of intranet searchThe art of intranet search
The art of intranet search
 
Design and Development of a Knowledge Community System
Design and Development of a Knowledge Community SystemDesign and Development of a Knowledge Community System
Design and Development of a Knowledge Community System
 
User Experience Webinar 1 - Eye-popping Content: Creating a User-friendly Fra...
User Experience Webinar 1 - Eye-popping Content: Creating a User-friendly Fra...User Experience Webinar 1 - Eye-popping Content: Creating a User-friendly Fra...
User Experience Webinar 1 - Eye-popping Content: Creating a User-friendly Fra...
 
Internship Wso2
Internship Wso2Internship Wso2
Internship Wso2
 
Meeting the Needs of a Rapidly Changing Workforce: the Learning Organization ...
Meeting the Needs of a Rapidly Changing Workforce: the Learning Organization ...Meeting the Needs of a Rapidly Changing Workforce: the Learning Organization ...
Meeting the Needs of a Rapidly Changing Workforce: the Learning Organization ...
 
Secrets of Successful Presenters
Secrets of Successful PresentersSecrets of Successful Presenters
Secrets of Successful Presenters
 
PD 2.0 - Getting your degree and beyond
PD 2.0 - Getting your degree and beyondPD 2.0 - Getting your degree and beyond
PD 2.0 - Getting your degree and beyond
 
Internship Final Report
Internship Final Report Internship Final Report
Internship Final Report
 
Ontology Based Opinion Mining for Book reviews
Ontology Based Opinion Mining for Book reviewsOntology Based Opinion Mining for Book reviews
Ontology Based Opinion Mining for Book reviews
 
Fostering Continuous Learning - Tips and Tricks
Fostering Continuous Learning - Tips and TricksFostering Continuous Learning - Tips and Tricks
Fostering Continuous Learning - Tips and Tricks
 
Free Technologies to Support Inclusion
Free Technologies to Support InclusionFree Technologies to Support Inclusion
Free Technologies to Support Inclusion
 

Speech Recognition , Noise Filtering and Content Search Engine , Research Document

  • 1. Ultimate Speech Search Page i ABSTRACT In the modern era people tends to find information where ever they can in a more efficient way. They search for the knowledge from past events so does the present events. Searching for a particular thing evolves a search engine and the necessary information. When they want to learn out of speeches or lectures done by any one they are going for a desperate search without knowing the actual results. If they have a luxury of a search engine that would give the required results that would be a blessing for their work. This project totally aims for build a search engine that will able to search for speeches and lectures by their content. Every search engine supports the feature of searching, but the results may be a jargon. The user has to go one by one and sometimes at the end of the day they will end up will a null result. The main goal of this project is to provide a search facility by the content. This research covers converting a speech in to text with a bit of noise analysis, maintaining a database with clustered indexing and a simple search facility by the content. The system that would build operates on a limited data such as speeches and lectures in a low noisy environment and as for the future enhancement it would be able to search for music or any other sound stream by the analysis of the spectrum with user friendly search facility. KEY WORDS Search Engine, Speeches, Lectures, Noise Analysis, Content, Spectrum
  • 2. Ultimate Speech Search Page ii ACKNOWLEDGEMENTS My sincere gratitude goes to my grandfather who taught me the ways of life and who raised me up from my childhood to a teenager and left me in a May. I would like to thank to my friends those who help me in my difficult times and praised me in my good times. I would like to thank to my college teachers who beat me from canes to make me a good man and gave me the knowledge to face the society. I would like to thank for my sister who always be a mother to me and I would like to show my gratitude for my supervisor Mrs. Nadeera Ahangama who guide to throughout the project. Finally I would like to thank to the APIIT staffs who provide us with necessary facilities to achieve our higher education and make it a success.
  • 3. Ultimate Speech Search Page iii Table of Contents ABSTRACT................................................................................................................................i ACKNOWLEDGEMENTS.......................................................................................................ii List of Figures..........................................................................................................................vii List of Equations.................................................................................................................... viii List of Tables ............................................................................................................................ix INTRODUCTION .....................................................................................................................1 1.1 Project Background.....................................................................................................1 1.2 Problem Description....................................................................................................2 1.3 Project Overview.........................................................................................................4 1.3.1 Noise analysis ......................................................................................................4 1.3.2 Speech recognition...............................................................................................4 1.3.3 Speech to text conversion ....................................................................................4 1.3.4 The database.........................................................................................................5 1.3.5 The search engine ......................................................................................................5 1.4 Project Scope...............................................................................................................6 1.5 Project Objectives .......................................................................................................7 RESEARCH...............................................................................................................................8 2.1 Speech Recognition..........................................................................................................8 2.2 Speech recognition methods...........................................................................................13 2.2.1 Hidden Markov methods in speech recognition......................................................13 2.2.2 Client side speech recognition.................................................................................16 2.2.5 Continuous speech recognition................................................................................18 2.2.6 Direct Speech Recognition ......................................................................................18 2.3 Speaker Characteristics ..................................................................................................19 2.3.1 Speaker Dependent..................................................................................................19 2.3.2 Speaker Independent................................................................................................19 2.3.3 Conclusion...................................................................................................................20 2.4 Speech Recognition mechanisms...................................................................................21 2.4.1 Isolated word recognition........................................................................................21
  • 4. Ultimate Speech Search Page iv 2.4.2 Continuous speech recognition................................................................................22 2.4.3 Conclusion...............................................................................................................23 2.5 Vocabulary Size .............................................................................................................24 2.5.1 Limited Vocabulary.................................................................................................24 2.5.2 Large Vocabulary ....................................................................................................24 2.5.3 Conclusion...............................................................................................................24 2.6 Speech recognition API‟s...............................................................................................25 2.6.1 Microsoft Speech API 5.3 .......................................................................................25 2.6.2 Java Speech API......................................................................................................26 2.7 Speech Recognition Algorithms ....................................................................................31 1. 8 Noise Filtering ...........................................................................................................32 1.8.1 Weiner filtering..................................................................................................33 1.8.2 Conclusion .........................................................................................................33 2.9 Database and data structure............................................................................................34 2.9.1 Conclusion...............................................................................................................34 2.10 Search Engine...............................................................................................................35 2.11 MATLAB.....................................................................................................................36 ANALYSIS..............................................................................................................................37 3.0 System requirements .................................................................................................37 3.11 Functional requirements ........................................................................................37 3.1.2 Non functional requirements ...................................................................................37 3.1.3 Software Requirements............................................................................................38 3.1.4 Hardware requirements............................................................................................39 3.2 System Development Methodologies.............................................................................40 3.2.1 Rational Unified Process .........................................................................................40 3.2.2 Agile Development Method ....................................................................................43 3.2.3Scrum Development Methodology...........................................................................45 3.3 Test Plan.........................................................................................................................47 3.3.1System testing...........................................................................................................47 SYSTEM DESIGN..................................................................................................................48 4.1 Use Case Diagram.....................................................................................................48
  • 5. Ultimate Speech Search Page v 4.2 Use case description.......................................................................................................50 4.2.1 Use case description for file upload ........................................................................50 4.2.2Use Case description for play an audio file..............................................................51 4.2.3 Use Case description for search...............................................................................52 4.2.4 Use Case description for noise reduced output .......................................................53 4.2.5 Use Case description for noise filtering ..................................................................54 4.3 Activity Diagrams ..........................................................................................................55 4.3.1Activity Diagram for Speech Recognition System...................................................55 4.3.2 Activity Diagram for Noise filtering .......................................................................56 4.4 Sequence Diagrams........................................................................................................57 4.4.1 Select a file ..............................................................................................................57 4.4.2 Play wav file ............................................................................................................58 4.4.3Speech recognition pre stage ....................................................................................59 4.4.4Speech Recognition post stage .................................................................................60 4.5 Class Diagrams...............................................................................................................61 4.5.1 GUI and the system .................................................................................................61 4.5.2 Speech recognition ..................................................................................................62 4.6 Noise Filtering................................................................................................................64 4.7 Code to filter noise in C Language.................................................................................67 CHAPTER 5 ............................................................................................................................73 5.0 Implementation ..................................................................................................................73 CHAPTER 6 ............................................................................................................................78 6.0 Test Plan.............................................................................................................................78 6.1 Background ....................................................................................................................78 6.2 Introduction....................................................................................................................78 6.3 Assumptions...................................................................................................................79 6.4 Features to be tested .......................................................................................................79 6.5 Suspension and resumption criteria ...............................................................................80 6.6 Environmental needs......................................................................................................81 6.7 System testing ................................................................................................................82 6.8 Unit testing.....................................................................................................................83
  • 6. Ultimate Speech Search Page vi 6.9 Performance Testing ......................................................................................................89 6.10 Integration Testing .......................................................................................................92 CHAPTER 7 ............................................................................................................................94 CRITICAL EVALUATION AND FUTURE ENHANCEMENTS ........................................94 7.1Critical evaluation ...........................................................................................................94 7.2 Suggestions for future enhancements.............................................................................99 8.0 Conclusion ..................................................................................................................101 REFERENCES ......................................................................................................................102 BIBLIOGRAPHY..................................................................................................................106 APPENDIX A........................................................................................................................107 APPENDIX B........................................................................................................................114 Gantt chart..........................................................................................................................114
  • 7. Ultimate Speech Search Page vii List of Figures Figure 1: Overview of Steps in Speech Recognition.................................................................8 Figure 2 : Graphical Overview of the Recognition Process ....................................................10 Figure 3: Components of a typical speech recognition system................................................12 Figure 4 : example of HMM for word “Yes” on an utterance.................................................15 Figure 5: Overview of Microsoft Speech Recognition API ...................................................25 Figure 6 : Java Sound API Architecture ..................................................................................29 Figure 7 : JSGF Architecture...................................................................................................30 Figure 8: Noise in Speech........................................................................................................32 Figure 9 : Database Indexing...................................................................................................34 Figure 10 : Google Architecture ..............................................................................................35 Figure 11 Phases in RUP .........................................................................................................41 Figure 12 : Overview of Agile.................................................................................................43 Figure 13 : Scrum Overview....................................................................................................46 Figure 15 : Use Case Diagram for System...............................................................................48 Figure 16 Speech Recognition.................................................................................................55 Figure 17 Activity Diagram Noise Filtering...........................................................................56 Figure 18 Sequence Diagram Select a file...............................................................................57 Figure 19 Sequence Diagram Play File ...................................................................................58 Figure 20 Sequence Diagram SR Pre Stage............................................................................59 Figure 21 Sequence Diagram SR Post Stage..........................................................................60 Figure 22 Class Diagrams GUI & System...............................................................................61 Figure 23 Class Diagram SR System.......................................................................................62 Figure 24 : Speech Search Class Diagram...............................................................................63 Figure 25: SR Engine...............................................................................................................73 Figure 26 Open file..................................................................................................................74 Figure 27: Text output .............................................................................................................75 Figure 28 Speech Search Engine .............................................................................................77
  • 8. Ultimate Speech Search Page viii List of Equations Equation 1 : First order Markov chain.....................................................................................13 Equation 2: Stationary states Transition ..................................................................................14 Equation 3: Observations independence..................................................................................14 Equation 4: observation sequence............................................................................................14 Equation 5 : Left Right topology constraints...........................................................................15 Equation 6: CSR Equations .....................................................................................................22
  • 9. Ultimate Speech Search Page ix List of Tables Table 1: Typical parameters used to characterize the capability of speech recognition system9 Table 2 : Comparison in different techniques in speech recognition.......................................17 Table 3: Isolated word recognition ..........................................................................................21 Table 4 : Use Case description file upload ..............................................................................50 Table 5 Use Case description play audio.................................................................................51 Table 6 Use Case description search .......................................................................................52 Table 7 Use Case description noise reduction .........................................................................53 Table 8 Use Case description noise process ............................................................................54 Table 9 Test Case 1..................................................................................................................83 Table 10 Test Case 2................................................................................................................84 Table 11 Test Case 3................................................................................................................85 Table 12 Test Case 4................................................................................................................86 Table 13 Test Case 5................................................................................................................87 Table 14 Test Case 6................................................................................................................88 Table 15: Performance testing windows XP............................................................................89 Table 16 : Performance Testing on UBUNTU ........................................................................90
  • 10. Ultimate Speech Search Page 1 CHAPTER 1 INTRODUCTION 1.1 Project Background Throughout the history of human civilization time played a key role. Humans achieved Technological advancement, scientific breakthroughs and unfortunately drawbacks within certain time goals. In many cases these time goals were set by nature. According to sooths point of view now we are live in an advanced era compared to prehistoric eras. We all are actors in another part of a chronicle play in our time. Due to the globalization distances in this planet narrowing. Within a shorter time limit people forced to accomplish objectives and goals and most of the time they are lacking certain amount of time in order to make it a success. Some part of a society ask to accomplish a goal they may go for a research , interviews or various any other fact finding techniques. Just imagine that they need to find certain information from lectures and speeches. Can they find the appropriate resource materials in a minimum time and with a minimum effort? They have to go through many search results and they have to commit most of their valuable time for a worthless task. If there is a way to find the lectures and speeches by searching by their content we could guarantee that we can save our valuable time in a respectable manner and we can invest this valuable time for deeds in sake of the planet earth.
  • 11. Ultimate Speech Search Page 2 1.2 Problem Description The problem is to provide with the users with a search engine in order to search lectures and speeches by their content for various purposes. In order to do this we have to come up with fair solutions for the challenges that meet throughout this process and they are as follows. Noise analysis: - we have to analyze the nature of the speech or the lecture. Speeches and lectures may come from various surrounding environmental conditions. This may directly effect to the vocal part of the speech. So we have to reduce the noise as much as possible. Speech recognition: - speech recognition is a vast area. Speeches can be done by many personalities with different accents. Each individual has his/her own accent when speaking in English or any other languages. In order to recognize the words they spoken we have to do a deep research in order to build a speech recognition server to overcome the speech recognition challenge. Speech to text conversion:-Speech to text conversion is one of the key areas of this project because it‟s the key point to build the database that contains the text version of speeches and lectures. The database: - All the converted versions of the speeches and lectures will be saved in the database. The search engine: - This is another challenging area of the project. The search engine will show the appropriate search results from the database. I need to find the searching
  • 12. Ultimate Speech Search Page 3 mechanisms and methods for the search in order to give the user with efficient and accurate results. Database and the search engine are two parallel problems that need to be developed more precisely. Without a proper structure for the database it‟s tedious to implement search functionality.
  • 13. Ultimate Speech Search Page 4 1.3 Project Overview The main challenge area of this project is to build the database containing the text version of speeches and lectures. In order to accomplish these phenomena we have to perform some tasks. 1.3.1 Noise analysis A noise analysis will perform in order to ensure an efficient speech to text conversion. This will enables us to isolate the human voice and remove the background environment in the audio file. This may include background noise such as tape hiss, electric fans or hums, etc. 1.3.2 Speech recognition Speech recognition comes in two flavors. They are speaker independent and speaker dependent. The voice of the speaker or the lecturer may change. Because of that the project uses speaker independent speech recognition. 1.3.3 Speech to text conversion The system converts the speech in text format in order to build the database. The database consists with the converted text version of the speeches and the lectures.
  • 14. Ultimate Speech Search Page 5 1.3.4 The database The database consists two parts. They are the converted (speech to text) speech file or the lecture file and the actual source files contains audio. 1.3.5 The search engine The search engine search for the content of a speech or a lecture from the database and gives the actual results. We might need to do something like summarizing. So the user can search from the content more easily by typing a sentence or a word.
  • 15. Ultimate Speech Search Page 6 1.4 Project Scope Existing search engines wont facilitates for search for a speech by its content. This system gives you the facility to search a speech by its content. The system contains data about English speeches and lectures. These speeches and lectures were done in a low noisy environment because the system would perform a less noise analysis. The system won‟t store music because the amount of noise analysis in higher compared to a low noisy environment. The speech recognition engine that going to build only supports for the English speeches and lectures and the noise analysis will only supports for the English speeches and lectures and speeches. The system will convert speeches and lectures (low noise) to text format. After the development process users will able to search from anywhere on this planet for a required result. Speaker independent speech recognition will be used because the system deals with different type of speeches performed by different persons with different accents.
  • 16. Ultimate Speech Search Page 7 1.5 Project Objectives 1.0 Noise analysis and reduction The system will performs noise filtering. This helps the speech recognition process. The noisy signal channel will analyzed and split in to two parts. Amplitude of the noisy channel set to low in value. An efficient noise filtering mechanism will use. 2.0 Continuous speech recognition system To develop an efficient speech recognition engine to convert speeches and lectures to a text format Speeches performed by various persons will be translated in to text format. 3.0 The Database Database implementation Converted version of the speeches and lectures will be stored in the data base in text format and the relevant speech or the lecture will be stored in another database 4.0 The search engine The search engine search for the content of a speech or a lecture from the database and gives the actual results. We might need to do something like summarizing. So the user can search from the content more easily by typing a sentence or a word.
  • 17. Ultimate Speech Search Page 8 CHAPTER 2 RESEARCH 2.1 Speech Recognition The process of converting a phonic signal captured by a phone or a microphone or any other audio device to a set of words is called speech recognition. Speech recognition is used in command based applications such as data entry control systems, documentation preparation, automation of telephone relay systems, in mobile devices such as in mobile phones and to help people with hearing disabilities. According to Professor Todd Austin (2007) Speech recognition is the task of translating an acoustic waveform representing human speech into its corresponding textual representation. Source(Aoustin,T. (2007). Speech Recognition. Available: http://cccp.eecs.umich.edu/research/speech.php. Last accessed 17 July 2009. ) Figure 1: Overview of Steps in Speech Recognition
  • 18. Ultimate Speech Search Page 9 Applications that support speech recognition are “introduced on a weekly basis and speech technology are rapidly entering new technical domains and new markets” (Java Speech API Programmers Guide, 1998) According to Zue et al. (2003), Speech recognition is a process that converts an acoustic signal which can be captured by a microphone, to a set of words. Speech recognition systems can be categorized by many parameters. Parameters Range Speaking mode Isolated words to continues speech Speaking Style Read Speech to spontaneous speech Enrolment Speaker dependent to speaker independent Small Small (<20 words) to large (>20000 words) Language Model Finite state to context sensitive Perplexity Small(<10) larger(>100) SNR High(>3dB) to low (<20dB) Transducer Voice cancelling microphone to telephone Table 1: Typical parameters used to characterize the capability of speech recognition system
  • 19. Ultimate Speech Search Page 10 According to Hosom et al. (2003), “The dominant technology used in Speech Recognition is called the Hidden Markov Model (HMM)”. There are four basics steps in performing speech recognition. They can be seen in the figure below. [Source: Hosom et al., 1999] Figure 2 : Graphical Overview of the Recognition Process
  • 20. Ultimate Speech Search Page 11 During pass few years speech recognition systems have achieved a remarkable success such their capability of recognition accuracy rate sometimes results over 98 percent. But that such accuracy rate was achieved in quite environments and by using sample words in training. It has been said that a good speech recognition system must be able to achieve good performance in many circumstances such as a noisy environment. Noise can come on many flavors. Air conditions , fans , radios , coughs , tape hiss , cross talks channel distortions , lips smack , breath noise , pops , sneeze are the basic factors that are engage in making a noisy environment. Typical component of a speech recognition system composed of Training data , Acoustic model , Language model , Training model, Lexical model, Speech signal, Representation, Model Classification , Search and Recognize words.
  • 21. Ultimate Speech Search Page 12 The figure below shows these components geometry in a speech recognition system. Figure 3: Components of a typical speech recognition system.
  • 22. Ultimate Speech Search Page 13 2.2 Speech recognition methods There is only few speech recognition methods are prevailing. They are categorizing as for the mobile devices and for standalone applications. 2.2.1 Hidden Markov methods in speech recognition Andre Markov is the founder of Markov process. Markov model involves probability and it uses over a finite sets usually called its states. When a state transition occurs it generates a character from the process. This model has a finite state Markov chain and a finite set of output probability distribution. Hidden Markov Constrains for speech recognition systems 1 – First order Markov chain. This has been made by the assumption that the probability of transition to a state depends only on the current state 𝑃 𝑞𝑡 + 1 = 𝑆𝑗 𝑞𝑡 = 𝑆𝑖 , 𝑞𝑡 − 1 = 𝑆𝑘 , 𝑞𝑡 − 2 = 𝑆𝑤 , … . . , 𝑞𝑡 − 𝑛 = 𝑆𝑧 𝑃 𝑞𝑡 + 1 = 𝑆𝑗 𝑞𝑡 = 𝑆𝑖 Equation 1 : First order Markov chain
  • 23. Ultimate Speech Search Page 14 2 – Stationary states Transition. This assumption proved that the state changes are mutually exclusive from the time. 𝑎𝑖𝑗 = 𝑃 𝑞𝑡 + 1 = 𝑆𝑗 𝑞𝑡 = 𝑆𝑖 Equation 2: Stationary states Transition 3 – Observations independence. This assumption regards to the state changes depend only on the underline Markov chain. However this assumption was depreciated. 𝑃 𝑂𝑡 𝑂𝑡 − 1, 𝑂𝑡 − 2, … . . , 𝑂𝑡 − 𝑝 , 𝑞𝑡 , 𝑞𝑡 − 1 , 𝑞𝑡 − 2 , … . 𝑞𝑡 − 𝑝 = 𝑃 𝑂𝑡 𝑞𝑡 , 𝑞𝑡 − 1 , 𝑞𝑡 − 2 , … . 𝑞𝑡 − 𝑝 Equation 3: Observations independence Where “p “represents considered history of the observation sequence. 𝑏𝑗 𝑂𝑡 = 𝑃 𝑂𝑡 𝑞𝑡 = 𝑗 Equation 4: observation sequence.
  • 24. Ultimate Speech Search Page 15 4 – Left-Right topology constraint: 𝑎𝑖𝑗 = 0 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑗 > 𝑖 + 2 𝑎𝑛𝑑 𝑗 < 𝑖 { 1 𝑓𝑜𝑟 𝑖 1 1 0 𝑓𝑜𝑟 1 𝑖 𝑁 ( ) = < £ = 𝑖 𝑖 𝑝 𝑃 𝑞 𝑆 Equation 5 : Left Right topology constraints The figure below shows an example of HMM for word “Yes” on an utterance. Figure 4 : example of HMM for word “Yes” on an utterance
  • 25. Ultimate Speech Search Page 16 2.2.2 Client side speech recognition According to Hosom et al. (2003), Client Side - Speech Recognition is technology that allows a computer to identify the words that a person speaks into a microphone or telephone. The basic advantages of having client side speech recognition are it assures a faster response time because all the processing handled in the client side. The other advantage is it does not use any network connections like GPRS. According to Hagen at el. (2003, p.66) the problems of client side speech recognition is, Recognition accuracy and Running time (power consumption). 2.2.3 Dynamic Time wrapping based speech recognition This method was used in past decades but now has been depreciated. This algorithm measures similarities between two sequences which may vary in time or speed. Number of templates being used in order to perform automatic speech recognition in Dynamic Time Wrapping based speech recognition. This process involves normalization of distortion and the lowest normalized distortion is identified as a word. 2.2.4 Artificial Neural Networks The mechanism inside ANN is to filter the human speech frequencies from the other frequencies due to the fact that the non speech sound covers higher frequency range than speech.
  • 26. Ultimate Speech Search Page 17 The table below shows a comparison between different speech recognition mechanisms. Source [anon. (nd). School of Electrical, Computer and Telecommunications Engineering. Available: http://www.elec.uow.edu.au/staff/wysocki/dspcs/papers/004.pdf]. Last accessed 23rd August 2009.] Table 2 : Comparison in different techniques in speech recognition
  • 27. Ultimate Speech Search Page 18 2.2.5 Continuous speech recognition Continuous speech recognition applies is used when a speaker pronounce words sentence or phrase that are in a series or specific order and are dependent on each other, as if linked together. This system operates on a system that words are connected to each other and not separated by pauses. Because there is more variety of effects it‟s a tedious task to manipulate it. Co articulation is another series issue in continuous speech recognition. . The effect of the surrounding phonemes to a single phoneme is high. Starting and ending words are affecting by the following words and also affected by the speed of the speech. It‟s harder to track down a fast speech. Two algorithms are usually involves in Continuous speech recognition. They are Viterbi Algorithm and Baum Welch Algorithm. 2.2.6 Direct Speech Recognition This process is responsible for identify the speech such that from a word by word and it follows by pauses.
  • 28. Ultimate Speech Search Page 19 2.3 Speaker Characteristics 2.3.1 Speaker Dependent Speaker Dependent speech recognition systems are developed for a single user purpose only. No other user can use the system and it will function with only a single user. These systems subjected to train by the user for the functionality purpose. One such advantage is that these kinds of systems support more vocabulary than the speaker independent system and the disadvantage is the limitation of usage for the type of users. This technology is used in steno masks . 2.3.2 Speaker Independent Speaker Independent speech recognition systems are harder to implement relative to the speaker dependent speech recognition systems. The system need to recognize the patterns and different accents spoken by many users. The advantage of this system is it can be used by many users without training. The most important steps in order to build a speaker independent SRS is to identify what parts of speech are generic, and which ones vary from person to person. The Speaker dependent speech recognition can be used by many users despite they are harder to implement.
  • 29. Ultimate Speech Search Page 20 2.3.3 Conclusion Speaker Independent speech recognition system has been selected for the project because the system has to deal with many speeches done by many users. The speech accent and phoneme patterns are different from a speaker to a speaker and it‟s not possible to perform an individual training for each and every speaker. Java Speech API only supports for speaker independent speech recognition systems and that‟s another reason to select speaker independent speech recognition.
  • 30. Ultimate Speech Search Page 21 2.4 Speech Recognition mechanisms 2.4.1 Isolated word recognition This identifies a single word at a time and pauses are involved between words. Isolated word recognition is the primary stage of speech recognition and it widely used in command based applications. Isolated speech recognition needs a less processing power and primary patter matching algorithms evolved. Table 3: Isolated word recognition
  • 31. Ultimate Speech Search Page 22 2.4.2 Continuous speech recognition According to Hunt, A. (1997) Continuous speech is more difficult to handle because it is difficult to find the start and end points of words and Co articulation - the production of each phoneme is affected by the production of surrounding phonemes. According to Peinado & Segura (2006, p.9), there are three types of errors in Continuous speech recognition systems. Substitutions - recognized sentence have different words substituting original words. Deletions - recognized sentence with missing words. Insertions - recognized sentence have new/extra words. Error rate calculation in Continuous speech recognition by Stephen at el. (2003, p.2) 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 = 𝐻1 𝑁2 𝑥 100% 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑁 − 𝐷3 − 𝑆4 − 𝐼 𝑁 𝑥 100% 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝐻 − 𝐼 𝑁 𝑥 100% 𝑊𝑜𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒 = 𝑆 + 𝐷 + 𝐼5 𝑁 𝑥 100% Equation 6: CSR Equations 1 Number of words correctly recognized 2 Total number of words in the actual speech 3 Deletions 4 Substitutions 5 Insertions
  • 32. Ultimate Speech Search Page 23 2.4.3 Conclusion As for the project continuous speech recognition mechanism has chosen because the system going to deal with continuous speeches in order to build a database and the back end of the system serves as a standalone application.
  • 33. Ultimate Speech Search Page 24 2.5 Vocabulary Size Vocabulary is the amount of words that known by a person. Greater the vocabulary size, the depth that he know is higher. The same rule applies for speech recognition systems. 2.5.1 Limited Vocabulary Limited vocabulary systems have a limited number of words. This can be varies 100 to 10000 words. These systems need a less processing power and more suitable for mobile devices. 2.5.2 Large Vocabulary Large Vocabulary size for a speech recognition system mainly used in servers or stand alone applications and evolves more processing power. It will identify almost every word speak by a person. This vocabulary has more than 10000 words. 2.5.3 Conclusion Large Vocabulary has been chosen for the project because the project‟s main processes are handled by standalone applications and it has to collaborate with many speeches.
  • 34. Ultimate Speech Search Page 25 2.6 Speech recognition API’s 2.6.1 Microsoft Speech API 5.3 Microsoft Speech API reduces the coding overload from the programmers. It‟s equipped with speech to text and text to speech recognition. This API requires a .NET based building environment and have to purchase. Scope of Speech Application Programming Interface or SAPI lies within windows environments. It allows the use of speech recognition and speech synthesis within Windows applications. Applications that use SAPI include Microsoft Office, Microsoft Agent and Microsoft Speech Server. In general SAPI defines a set of interfaces and classes to develop dynamic speech recognition systems. SAPI uses two libraries for its front end and for its back end. For front end it uses the “Fast format” library. For the back end SAPI uses the “Pantheios”. Both these are C++ open source libraries. Figure 5: Overview of Microsoft Speech Recognition API
  • 35. Ultimate Speech Search Page 26 2.6.2 Java Speech API Java Speech API provides the both speech recognition and synthesis capabilities and it is freely available. JSAPI supports for multiple platform development and supports open source and non open source third party tools. JSAPI package comprises with java.speech, javax.speech.recognition and javax.speech.synthesis. Sun Micro Systems build JSAPI in collaboration with  Apple Computer, Inc.  AT&T  Dragon Systems, Inc.  IBM Corporation  Novell, Inc.  Philips Speech Processing  Texas Instruments Incorporated It supports speaker independent speech recognition and W3C standards. Speech recognizer‟s capabilities:  Built-in grammars (device specific)  Application defined grammars Speech synthesizer‟s capabilities:  Formant synthesis  Concatenate synthesis
  • 36. Ultimate Speech Search Page 27 Java Speech API specifies a cross-platform interface to support command and control recognizers, dictation systems and speech synthesizers. Java Speech API has two technologies. They are speech synthesis and speech recognition. Speech synthesis provides the reverse process of producing synthetic speech from text generated by an application, an apple, or a user. With the synthesis capabilities developer‟s can build applications to generate speech from the text. There are two primary steps to produce speech from a text. Structure analysis: Processes the input text to determine where paragraphs, sentences, and other structures start and end. For most languages, punctuation and formatting data are used in this stage. Text pre-processing: Analyzes the input text for special constructs of the language. In English, special treatment is required for abbreviations, acronyms, dates, times, numbers, currency amounts, e-mail addresses, and many other forms. Other languages need special processing for these forms, and most languages have other specialized requirements. Speech recognition grants the privileges for the computer to listen to human speech understand and recognize and converts in to text.
  • 37. Ultimate Speech Search Page 28 There are some steps in order to build a speech recognition system.  Grammar design: Defines the words that may be spoken by a user and the patterns in which they may be spoken.  Signal processing: Analyzes the spectrum characteristics of the incoming audio.  Phoneme recognition: Compares the spectrum patterns to the patterns of the phonemes of the language being recognized.  Word recognition: Compares the sequence of likely phonemes against the words and patterns of words specified by the active grammars.  Result generation: Provides the application with information about the words the recognizer has detected in the incoming audio. Behalf of JSAPI we need another two Java API‟s. They are Java Sound API and Java media frame work. Java sound API has the capabilities of handling sounds and it‟s equipped with a rich set of classes and interfaces that directly deals with incoming sound signals. Java Sound API widely used for the following areas and industries.  Communication frameworks, such as conferencing and telephony  End-user content delivery systems, such as media players and music using streamed content  Interactive application programs, such as games and Web sites that use dynamic content  Content creation and editing  Tools, toolkits, and utilities Java sound API uses a hardware independent architecture. It is designed to allow different sorts of audio components to be installed on a system and accessed by the API.
  • 38. Ultimate Speech Search Page 29 With Java Sound API we can process both the MIDI 6 and wav sound formats. Java media frame work is a recently developed frame work which can be used to build dynamic multimedia applications. 6 Musical Instrument Digital Interface Figure 6 : Java Sound API Architecture
  • 39. Ultimate Speech Search Page 30 2.6.2.1 Java Speech and Grammar format JSGF or Java speech and Grammar Format was built by the Sun Micro systems. It defines the set of rules and words for speech recognition. JSGF is plat form independent specification and it is derived from the Speech recognition Grammar Specification. The Java Speech Grammar Format has been developed for use with recognizers that implement the Java Speech API. However, it may also be used by other speech recognizers and in other types of applications. A typical grammar rule is a composition of what to be spoken, the text to be spoken and a reference to other grammar rules. A JSGF file comes in a normal file format or in XML format. source (anon. (nd). JSGF Architecture. Available: http://www.cs.cmu.edu/. Last accessed 24th july 2009.) Figure 7 : JSGF Architecture
  • 40. Ultimate Speech Search Page 31 2.7 Speech Recognition Algorithms Viterbi Algorithm is widely used in speech recognition. It is supports for dynamic programming. This algorithm directly deals with the hidden Markov methods. Baum Welch Algorithm is another algorithm used in this process. It evolves probability and maximum likelihood. Forward Backward algorithm is another algorithm used in this process and it is directly deals with hidden Markov methods. There are three steps in this algorithm.  Computing forward probabilities  Computing backward probabilities  Computing smoothed values A combination of the above algorithms (a customized version) will use in the project.
  • 41. Ultimate Speech Search Page 32 1. 8 Noise Filtering Noise can be emerged in a speech by tape hiss, clapping, cough or by any other relevant environmental or machinery factors. Noise plays a major role in the play of speech recognition. Source (anon. (nd). Departement Elektrotechniek. Available: http://www.esat.kuleuven.be/psi/spraak/theses/08-09-en/clp_lp_mask.png. Last accessed 22 September 2009) Figure 8: Noise in Speech
  • 42. Ultimate Speech Search Page 33 According to Khan, E., and Levinson, R (1998) Speech recognition has achieved quite remarkable progress in the past years.  Many speech recognition systems are capable of producing very high recognition accuracies (over 98%).  But such recognition accuracy only applies for a quiet environment (very low noise) and for speakers whose sample words were used during training. Spectral subtraction and Weiner filtering are the two most popular methods that are available in noise reduction because they are straight forward to implement. 1.8.1 Weiner filtering Weiner filtering is a common model that applies for filtering noise. z(k), is a signal, s(k), plus additive noise, n(k), that is uncorrelated with the signal z(k) = s(k) + n(k). If the noise is also stationary then the power spectra of the signal and noise add 𝑃𝑧 𝑤 = 𝑃𝑠 𝑤 + 𝑃, 𝑤 1.8.2 Conclusion Weiner filtering method has been chosen to the project because it is widely acceptable method and so easy to implement.
  • 43. Ultimate Speech Search Page 34 2.9 Database and data structure Database contains the text version of speeches and their location. Sample database maintains in the hard disk and the locations are saved in a file. Database indexing used for efficient search results. Database indexing improves the speed of data structure. Indexing can be divided in to two parts that is clustered and none clustered. None clustered indexing doesn‟t bother about the order of the actual records. This results additional input and output operations to get the actual results. In clustering indexing it reorders data according to their indexes as data blocks. It‟s more efficient for the searching purposes. 2.9.1 Conclusion Clustered indexing has been chosen for the project because the system evolves search operation for speeches. Figure 9 : Database Indexing
  • 44. Ultimate Speech Search Page 35 2.10 Search Engine Search engine basically act as the terminal for searching speeches and lectures. It will check for search results in locally deployed database that contains the text version of speeches and lectures. A search engine operates in the order of web crawling, indexing and searching. Source(Sergy ,B. Lawrence,P.. (nd). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Available: http://infolab.stanford.edu/~backrub/google.html. Last accessed 24 march 2009.) Figure 10 : Google Architecture
  • 45. Ultimate Speech Search Page 36 2.11 MATLAB MATLAB was developed by MathWorks. MathWorks is a privately held multinational company. They are specialized in technical software. MATLAB is a multi platform fourth generation programming language. Just like other many languages MATLAB supports the following features.  Matrix manipulation  Plotting of functions and data  Algorithm implementation  Create Graphical user interfaces  Interfacing with other programming languages Most of the MATLAB code snippets show a numerical nature. Regardless of that factor by using MATLAB we can build systems in a more precise manner and the line of codes that required buildings the system are relatively few compared with other languages such as JAVA or C#. Just like other object oriented languages MATALB supports classes, interfaces and functions. They are used in high level MATLAB programming. MATLAB directly supports both the Analogue and Digital Signal processing. It has defined a set of rich features to work with Analogue and Digital Signal Processing. Signal transforms and spectral analysis, digital system design, digital filtering, adaptive filtering, coding and compression algorithms are the features which supports by MATLAB.
  • 46. Ultimate Speech Search Page 37 CHAPTER 3 ANALYSIS 3.0 System requirements 3.11 Functional requirements 1. The application must convert to speech or the lecture to a text format. 2. Converted text should be visible to the user. 3. If the speech or the lecture has noise it must be reduced in a manner that eligible for speech recognition process. 4. Speeches with different accent need to be identified in a reasonable manner by the system. 5. The search results must be efficient and reliable. 3.1.2 Non functional requirements 1. Search algorithm need to be efficient. 2. Should not cater duplicate search results. 3. Should not take more time in searching. 4. Speech to text conversion must be efficient and accurate. 5. Noise reduction must maintain a fair performance.
  • 47. Ultimate Speech Search Page 38 3.1.3 Software Requirements Java JDK 1.6:- JDK 1.6 equipped with the state of the art technology and includes much functionality. Java Sound API newest version must be required. NetBeans IDE 6.5:- is an open source IDE and it equipped with h PHP, JavaScript and Ajax editing features, improved and the Java Persistence API, and tighter Glassfish v3 and MySQL integration. It also facilitates features for the architectural drawing of the system. It also equipped powerful J2EE components that are essential to build the search engine. We can integrate any third party component that used for the system without much efforts and it has the feature of code generation. Many non open source plugging supports this IDE. Windows XP or equivalent operating system: - Windows XP operating system supports both the open source components as well as commercialized components. We can deploy everything that is essential for our project. Windows XP is a robust error less, user friendly operating system compare to other windows operating system. Apache tomcat server: version 5.5.27-Available in http://tomcat.apache.org/ is a freely available server that we can run web programs on it. It is robust and open source. It has many third party components that s essential to integrate stand alone, mobile, web based application in to each other. This server comes with the NetBeans IDE. XML Database:- This is the world‟s most popular database and its open source. It directly supports for apache tomcat server and the NetBeans IDE and the crashing rate are lesser compared to other databases with web services. Proper sound driver software is required in order to achieve best results. Matlab software required to perform noise filtering.
  • 48. Ultimate Speech Search Page 39 3.1.4 Hardware requirements 32 bit Intel Dual Core IV processor or greater:- concern about the development phase of the project a massive amount of processing power is required as for the speech recognition and for noise analysis , text to speech conversion and for the search. It is advisable to have high end machine inured to prevent deadlocks.  64 bit PCI sound card: - A high end sound card required to process digital audio signals.  Minimum of 1 GB DDR3 RAM is required and 2 GB of virtual memory must be present in the system.  The default components of a personal computer must required  A modem or a router is required in order to test the search between many users.  1mb ADSL internet connection or greater is required for the data gathering.  A microphone must need as for the future enhancements. So the users can store their own speech and as for a future use any one can search any particular speeches and lectures. 20GB hard disk is required with 7200 or more rotation rate because the system going to maintain the database in my machine. Note: At least a Duel Core Processor is required because the speech recognition process needs a massive processing power.
  • 49. Ultimate Speech Search Page 40 3.2 System Development Methodologies All the methodologies compared in here were extended versions of previously commonly used methodologies. 3.2.1 Rational Unified Process Rational Unified process is a development methodology created by the rational software division of IBM in 2003. It‟s an iterative system development process. RUP explains how specific goals are achieved in a detailed manner. RUP is a methodology of Managing Object Oriented software development. According to Kroll and Kruchten (2003) “The RUP is a software development approach that is iterative, architecture-centric, and use-case-driven.” RUP has extensible features and they are as follows.  Iterative Development  Requirements Management  Component-Based Architectural Vision  Visual Modeling of Systems  Quality Management  Change Control Management
  • 50. Ultimate Speech Search Page 41 The figure below shows basic overviews of its phases. Source: (anon. (nd). Department of Computer Science. Available: people.cs.uchicago.edu/~matei/CSPP523/lect4.ppt. Last accessed 24 march 2009.) Figure 11 Phases in RUP
  • 51. Ultimate Speech Search Page 42 Advantages of RUP  It is a well-defined and well-structured software engineering process.  It supports changing requirements and provides means to manage the change and related risks  It promotes higher level of code reuse.  It reduces integration time and effort, as the development model is iterative.  It allows the systems to run earlier than with other processes that essential for the system.  Risk management feature allows identifying risks before the development process.  It has the unique feature that “Plan a little”, “design a little” and codes a little.  RUP is an idea driven, principle based methodology.  RUP methodology is a worldwide commercial standard. Disadvantages of RUP  For most of the projects RUP is an insufficient methodology.  We need to customize the processes due to various situations.  It has a poor usability support.  The process in relatively complex and the weight age is high.
  • 52. Ultimate Speech Search Page 43 3.2.2 Agile Development Method Agile development methodology is an iterative process. Agile has short time iterations and due to that have minimum risk. The Agile software development methodology has the feature of break tasks into small increments with minimal planning and it won‟t directly involve long term planning. Agile highly supports for object oriented developing. Most of all Agile has the unique feature called Extreme programming, now widely used in software development process. According to Ambler (2005) Agile is an iterative and incremental (evolutionary) approach to software development which is performed in a highly collaborative manner by self- organizing teams within an effective governance framework that produces high quality software in a cost effective and timely manner which meets the changing needs of its stakeholders. Figure 12 : Overview of Agile
  • 53. Ultimate Speech Search Page 44 Advantages in Agile Software Development  Increased Control  Rapid Learning  Early Return on Investment  Satisfied Stakeholders  Responsiveness to Change Disadvantages in Agile Software Development  Agile evolves heavy documentation.  Agile Requirements are barely insufficient for the projects.  Not an organized methodology.  Because testing is integrated through the development the development cost is relatively high.  Too much user involvement may spoil the project.
  • 54. Ultimate Speech Search Page 45 3.2.3Scrum Development Methodology According to Mikneus,s , S., Akinde, A. (2003)  Scrum is an Agile Software Development Process.  Scrum is not an acronym  Name taken from the sport of Rugby, where everyone in the team pack acts together to move the ball down the field  Analogy to development is the team works together to successfully develop quality software According to Jeff Sutherland (2003) “Scrum assumes that the systems development process is an unpredictable, complicated process that can only be roughly described as an overall progression.” “Scrum is an enhancement of the commonly used iterative/incremental object- oriented development cycle” Scrum principles include:  Quality work: empowers everyone involved to be feeling good about their job.  Assume Simplicity: Scrum is a way to detect and cause removal of anything that gets in the way of development.  Embracing Change: Team based approach to development where requirements are rapidly changing.  Incremental changes: Scrum makes these possible using sprints where a team is able to deliver a product (iteration) deliverable within 30 days.
  • 55. Ultimate Speech Search Page 46 Advantages in Scrum  Scrum has the ability to respond unseen software development risks  It‟s a specialized process for commercial application development.  It gives the developers of facility to deliver a functional application to the clients. Disadvantages in Scrum  Not suitable for researched based software developments. Source[anon. (nd). anon. Available: http://www.methodsandtools.com/archive/scrum1.gif. Last accessed 26 th March 2009.] 3.2.4 Conclusion Agile software development methodology has been chosen for the development process because it supports Object oriented development, has short iterations and supports Extreme programming. Figure 13 : Scrum Overview
  • 56. Ultimate Speech Search Page 47 3.3 Test Plan The Systems main functionalities are noise analysis, speech recognition, database indexing (directly effects to the search) and the search engine. The system takes data (Speeches and lectures) from various conditions with lowered noise. But the system cannot assure the effect of the noise factor. Due to that reason we perform noise analysis and try to reduce it. Otherwise it will affect to the speech recognition process. 3.3.1System testing Speeches and lectures with different accent [English only USA and British]:- In order to test the speech recognition engines accuracy it will tested against different accents. The expected results must be in a minimum difference with minimum errors. Content Search:- when the user tries to search by the content by typing a word or a phrase the appropriate search result will be displayed. The speech or the lecture containing the specified words or the phrase will be displayed
  • 57. Ultimate Speech Search Page 48 CHAPTER 4 SYSTEM DESIGN 4.1 Use Case Diagram The noise filters functionality implements separately from the speech recognition system. Noise filtering system represents as the “Actor”. Figure 14 : Use Case Diagram for System
  • 58. Ultimate Speech Search Page 49 The Figure above shows the Use Case Diagram for the entire system. The System mainly consists with two actors. A user can uploads speech file in wav format to perform the speech recognition. Noise filtering handled by a separate system. The user has to upload a noisy speech file and the noise filtering system will produce a file with lowered noise.
  • 59. Ultimate Speech Search Page 50 4.2 Use case description 4.2.1 Use case description for file upload Use Case Use Case One Description User uploads a file Actors user Assumptions User uploads a file in .wav format. The user has to upload a file without noise. Steps User has to run the system, press open button and have to select a file Variations A user may uploads a file without noise or with noise, Non functional requirements All the necessary hardware configuration must met. Issues None Table 4 : Use Case description file upload
  • 60. Ultimate Speech Search Page 51 4.2.2Use Case description for play an audio file Use Case Use Case Two Description User plays a .wav file Actors User Assumptions User can only play a file in wav format Steps User has to open a file, and then the button play gets enabled. User has to press the play button. Variations No variations , only files in wav format can be played Non functional Requirements All the necessary hardware configuration must met. Issues None Table 5 Use Case description play audio
  • 61. Ultimate Speech Search Page 52 4.2.3 Use Case description for search Use Case Use Case Three Description User search for a speech by content Actors User Assumptions User can search a speech by typing a sentences Steps User has to run the speech search program. Type the thing he/she wants to search for and presses search Variations No variations Non functional All the necessary hardware configuration must met. Issues None Table 6 Use Case description search
  • 62. Ultimate Speech Search Page 53 4.2.4 Use Case description for noise reduced output Use Case Use Case Four Description Noise reduction output produced by the system Actors Noise filtering system Assumptions Permanent elimination of the noise is unreachable. User uploads a noisy file in a wav format Steps User has to run the noise filtering program in MATLAB User has to input a file which includes the noise Variations No variations Non functional All the necessary hardware configuration must met. Issues None Table 7 Use Case description noise reduction
  • 63. Ultimate Speech Search Page 54 4.2.5 Use Case description for noise filtering Use Case Use Case Five Description The process of filtering noise Actors Noise filtering system Assumptions Permanent elimination of the noise is unreachable. User uploads a noisy file in a wav format The chosen mechanism for noise filtering is the most suitable one Steps User has to run the noise filtering program in MATLAB User has to input a file which includes the noise Variations No variations Non functional All the necessary hardware configuration must met. Issues None Table 8 Use Case description noise process
  • 64. Ultimate Speech Search Page 55 4.3 Activity Diagrams 4.3.1Activity Diagram for Speech Recognition System Figure 15 Speech Recognition
  • 65. Ultimate Speech Search Page 56 4.3.2 Activity Diagram for Noise filtering Figure 16 Activity Diagram Noise Filtering
  • 66. Ultimate Speech Search Page 57 4.4 Sequence Diagrams 4.4.1 Select a file Figure 17 Sequence Diagram Select a file
  • 67. Ultimate Speech Search Page 58 4.4.2 Play wav file The system can play a file. Two main control classes involve this process. The WavFileRecognition class acts as a mediator which passes messages between functionalities on other classes. Figure 18 Sequence Diagram Play File
  • 68. Ultimate Speech Search Page 59 4.4.3Speech recognition pre stage In Speech recognition pre stage, the system gets loaded with the configuration file and input signal. A recognizer will allocate through the configuration manager. Figure 19 Sequence Diagram SR Pre Stage
  • 69. Ultimate Speech Search Page 60 4.4.4Speech Recognition post stage In speech recognition post stage the input digital signal will go through fast Fourier transformation segmenting, identifying dialects and phonemes. The Classes AudioFileDataSource and the Recognizer facilitates functionalities to perform these tasks. Figure 20 Sequence Diagram SR Post Stage
  • 70. Ultimate Speech Search Page 61 4.5 Class Diagrams 4.5.1 GUI and the system The figure below shows the class diagram of the GUI and WavFileRecognizer. Figure 21 Class Diagrams GUI & System
  • 71. Ultimate Speech Search Page 62 4.5.2 Speech recognition Figure 22 Class Diagram SR System
  • 72. Ultimate Speech Search Page 63 Class Diagram for Speech search Figure 23 : Speech Search Class Diagram
  • 73. Ultimate Speech Search Page 64 4.6 Noise Filtering Noise filtering has done using Matlab. Matlab support objects orientation, polymorphism or inheritance. I have generated a code in C to tally the code in Matlab. %ver 1.56 function noiseReduction %----- user data ----- steps_1 = 512; chunk = 2048; coef = 0.01*chunk/2; The 3 code segments above defines user data which going to use in MATLAB script. The term chunk means a small piece of segment of the input signal. The script below can be used to filter the noise for any given input signal. %Windowing Techniques %w1 = .5*(1 - cos(2*pi*(0:chunk-1)'/(chunk))); %hanning w1 = [.42 - .5*cos(2*pi*(0:chunk-1)/(chunk-1)) + .08*cos(4*pi*(0:chunk-1)/(chunk-1))]'; %Blackman w2 = w1; Backman Window technique used here to chop the signal in to small segments. In here the input signal will recursively split in to small chunks. Chunk is the technical term for a segment in digital signal processing. % input wav file and extract required data [input, FS, N] = wavread('input.wav'); L = length(input); The input signal will extract and re arrange in to a matrix. Length is the total propagating duration of the signal. The matrix mechanism hidden by the MATLAB.
  • 74. Ultimate Speech Search Page 65 % zero padding for intput file input = [zeros(chunk,1);input;zeros(chunk,1)]/ max(abs(input)); %the appended zeros to the back of the input sound file makes it so that the windowing samples the complete sound file %----- initializations ----- output = zeros(length(input),1); count = 0; % block by block fft algorithm Normally a noise signal has a higher frequency. After the system gets median value for noise factor. The functions below recursively take segments and analyze the mean value. while count<(length(input) - chunk) grain = input(count+1:count+chunk).* w1; % windowing f = fft(grain); % fft of window data r = abs(f); % magnitude of window data phi = angle(f); % phase of window data ft = denoise(f,r,coef); This function will reduce the amplitude of each chunk. A single chunk will take as an argument by the function. grain = real(ifft(ft)).*w2; % take inverse fft of window data output(count+1:count+chunk) = output(count+1:count+chunk) + grain; % append data to output file count = count + steps_1; % increment by hop size end output = output(1:L) / (4.75*max(abs(output))); %the 4.75*max(abs(output) maintains consistency between input and output volume %soundsc(output, FS); wavwrite(output, FS, 'output.wav'); As you can see there are no classes or Interfaces. Equivalent code for the Matlab in C programming language is shown below.
  • 75. Ultimate Speech Search Page 66 function ft = denoise(f,r,coef) if abs(f) >= 0.001 ft = f.*(r./(r+coef)); else ft = f.*(r./(r+sqrt(coef))); end The shown above is denoise function. The function analyzes each signal chunk‟s absolute frequency against its mean value. Then it will get modified by the coefficient and the square root recursively. This process continues till the higher frequency clusters eliminates to lower frequencies.
  • 76. Ultimate Speech Search Page 67 4.7 Code to filter noise in C Language #include <stdio.h> #include "mclmcr.h" #ifdef __cplusplus extern "C" { #endif extern const unsigned char __MCC_denoise2_public_data[]; extern const char *__MCC_denoise2_name_data; extern const char *__MCC_denoise2_root_data; extern const unsigned char __MCC_denoise2_session_data[]; extern const char *__MCC_denoise2_matlabpath_data[]; extern const int __MCC_denoise2_matlabpath_data_count; extern const char *__MCC_denoise2_mcr_runtime_options[]; extern const int __MCC_denoise2_mcr_runtime_option_count; extern const char *__MCC_denoise2_mcr_application_options[]; extern const int __MCC_denoise2_mcr_application_option_count; #ifdef __cplusplus } #endif static HMCRINSTANCE _mcr_inst = NULL; static int mclDefaultPrintHandler(const char *s) { return fwrite(s, sizeof(char), strlen(s), stdout); } static int mclDefaultErrorHandler(const char *s) { int written = 0, len = 0; len = strlen(s); written = fwrite(s, sizeof(char), len, stderr); if (len > 0 && s[ len-1 ] != 'n') written += fwrite("n", sizeof(char), 1, stderr); return written; } bool denoise2InitializeWithHandlers( mclOutputHandlerFcn error_handler, mclOutputHandlerFcn print_handler ) { if (_mcr_inst != NULL) return true;
  • 77. Ultimate Speech Search Page 68 if (!mclmcrInitialize()) return false; if (!mclInitializeComponentInstance(&_mcr_inst, __MCC_denoise2_public_data, __MCC_denoise2_name_data, __MCC_denoise2_root_data, __MCC_denoise2_session_data, __MCC_denoise2_matlabpath_data, __MCC_denoise2_matlabpath_data_count, __MCC_denoise2_mcr_runtime_options, __MCC_denoise2_mcr_runtime_option_count, true, NoObjectType, ExeTarget, NULL, error_handler, print_handler)) return false; return true; } bool denoise2Initialize(void) { return denoise2InitializeWithHandlers(mclDefaultErrorHandler, mclDefaultPrintHandler); } void denoise2Terminate(void) { if (_mcr_inst != NULL) mclTerminateInstance(&_mcr_inst); } int main(int argc, const char **argv) { int _retval; if (!mclInitializeApplication(__MCC_denoise2_mcr_application_options, __MCC_denoise2_mcr_application_option_count)) return 0; if (!denoise2Initialize()) return -1; _retval = mclMain(_mcr_inst, argc, argv, "denoise2", 0); if (_retval == 0 /* no error */) mclWaitForFiguresToDie(NULL); denoise2Terminate(); mclTerminateApplication(); return _retval; } /* * MATLAB Compiler: 4.0 (R14) * Date: Sun Oct 04 09:55:11 2009 * Arguments: "-B" "macro_default" "-m" "-W" "main" "-T" "link:exe" "denoise2"
  • 78. Ultimate Speech Search Page 69 */ #ifdef __cplusplus extern "C" { #endif const unsigned char __MCC_denoise2_public_data[] = {'3', '0', '8', '1', '9', 'D', '3', '0', '0', 'D', '0', '6', '0', '9', '2', 'A', '8', '6', '4', '8', '8', '6', 'F', '7', '0', 'D', '0', '1', '0', '1', '0', '1', '0', '5', '0', '0', '0', '3', '8', '1', '8', 'B', '0', '0', '3', '0', '8', '1', '8', '7', '0', '2', '8', '1', '8', '1', '0', '0', 'C', '4', '9', 'C', 'A', 'C', '3', '4', 'E', 'D', '1', '3', 'A', '5', '2', '0', '6', '5', '8', 'F', '6', 'F', '8', 'E', '0', '1', '3', '8', 'C', '4', '3', '1', '5', 'B', '4', '3', '1', '5', '2', '7', '7', 'E', 'D', '3', 'F', '7', 'D', 'A', 'E', '5', '3', '0', '9', '9', 'D', 'B', '0', '8', 'E', 'E', '5', '8', '9', 'F', '8', '0', '4', 'D', '4', 'B', '9', '8', '1', '3', '2', '6', 'A', '5', '2', 'C', 'C', 'E', '4', '3', '8', '2', 'E', '9', 'F', '2', 'B', '4', 'D', '0', '8', '5', 'E', 'B', '9', '5', '0', 'C', '7', 'A', 'B', '1', '2', 'E', 'D', 'E', '2', 'D', '4', '1', '2', '9', '7', '8', '2', '0', 'E', '6', '3', '7', '7', 'A', '5', 'F', 'E', 'B', '5', '6', '8', '9', 'D', '4', 'E', '6', '0', '3', '2', 'F', '6', '0', 'C', '4', '3',
  • 79. Ultimate Speech Search Page 70 '0', '7', '4', 'A', '0', '4', 'C', '2', '6', 'A', 'B', '7', '2', 'F', '5', '4', 'B', '5', '1', 'B', 'B', '4', '6', '0', '5', '7', '8', '7', '8', '5', 'B', '1', '9', '9', '0', '1', '4', '3', '1', '4', 'A', '6', '5', 'F', '0', '9', '0', 'B', '6', '1', 'F', 'C', '2', '0', '1', '6', '9', '4', '5', '3', 'B', '5', '8', 'F', 'C', '8', 'B', 'A', '4', '3', 'E', '6', '7', '7', '6', 'E', 'B', '7', 'E', 'C', 'D', '3', '1', '7', '8', 'B', '5', '6', 'A', 'B', '0', 'F', 'A', '0', '6', 'D', 'D', '6', '4', '9', '6', '7', 'C', 'B', '1', '4', '9', 'E', '5', '0', '2', '0', '1', '1', '1' , '0'}; const char *__MCC_denoise2_name_data = "denoise2"; const char *__MCC_denoise2_root_data = ""; const unsigned char __MCC_denoise2_session_data[] = {'7', '7', 'B', 'D', '1', '6', '2', '3', '5', '5', '4', '5', '0', 'A', 'B', '1', '7', '3', '9', '0', '4', 'D', '4', '6', '7', '2', 'E', '3', '6', 'B', '3', '2', '4', '7', '5', '6', '1', '0', 'F', '3', '5', '2', '8', 'D', '5', '3', '8', '2', '3', '4', '4', 'A', '6', 'B', '6', '3', '8', 'E', '4', 'E', 'A', '8', '2', 'F', '9', '4', '1', '8', 'E', '9', '1', 'C', '1', 'F', '8', 'F', '7', '6', '0', '2', 'D', 'B', '3', 'B', 'F', '3', '4', '9', 'B', 'C',
  • 80. Ultimate Speech Search Page 71 '2', '8', 'C', '6', 'A', '9', '9', '6', '4', '9', '6', '3', 'C', '6', '8', '4', '1', '1', '8', '5', '5', 'E', '2', '3', '5', 'B', '9', '7', '9', '7', '0', '9', 'B', 'A', 'F', '7', 'E', 'D', '0', 'C', '0', '5', 'F', 'E', '2', 'C', '6', '3', '6', '6', 'D', 'F', 'B', '6', '0', 'F', '6', 'B', 'F', 'F', '2', '9', '4', '4', '2', '0', '3', 'C', 'C', 'C', '8', 'E', '3', '7', 'F', 'A', '4', '5', 'A', '9', 'A', '5', 'B', '7', '2', '0', '0', 'B', 'E', '3', 'F', 'E', '0', 'E', 'B', '1', 'C', '0', '7', 'D', '3', '9', 'D', 'F', '0', '7', '4', '2', 'B', '9', 'E', '3', 'A', '2', 'F', '3', '3', 'E', '9', '8', 'E', '5', 'C', '9', 'B', 'B', 'D', '3', '6', 'B', '7', 'D', 'E', '8', '3', '2', 'B', '9', '7', '5', 'F', '3', '0', '7', '7', 'D', 'F', '8', '1', 'F', 'A', '9', 'B', '4', 'F', 'E', '3', '5', '4', 'F', 'B', '1', '8', 'E', '1', 'D', '0'}; const char *__MCC_denoise2_matlabpath_data[] = {"denoise2/", "toolbox/compiler/deploy/", "$TOOLBOXMATLABDIR/general/", "$TOOLBOXMATLABDIR/ops/", "$TOOLBOXMATLABDIR/lang/", "$TOOLBOXMATLABDIR/elmat/", "$TOOLBOXMATLABDIR/elfun/", "$TOOLBOXMATLABDIR/specfun/", "$TOOLBOXMATLABDIR/matfun/", "$TOOLBOXMATLABDIR/datafun/", "$TOOLBOXMATLABDIR/polyfun/",
  • 81. Ultimate Speech Search Page 72 "$TOOLBOXMATLABDIR/funfun/", "$TOOLBOXMATLABDIR/sparfun/", "$TOOLBOXMATLABDIR/scribe/", "$TOOLBOXMATLABDIR/graph2d/", "$TOOLBOXMATLABDIR/graph3d/", "$TOOLBOXMATLABDIR/specgraph/", "$TOOLBOXMATLABDIR/graphics/", "$TOOLBOXMATLABDIR/uitools/", "$TOOLBOXMATLABDIR/strfun/", "$TOOLBOXMATLABDIR/imagesci/", "$TOOLBOXMATLABDIR/iofun/", "$TOOLBOXMATLABDIR/audiovideo/", "$TOOLBOXMATLABDIR/timefun/", "$TOOLBOXMATLABDIR/datatypes/", "$TOOLBOXMATLABDIR/verctrl/", "$TOOLBOXMATLABDIR/codetools/", "$TOOLBOXMATLABDIR/helptools/", "$TOOLBOXMATLABDIR/winfun/", "$TOOLBOXMATLABDIR/demos/", "toolbox/local/", "toolbox/compiler/"}; const int __MCC_denoise2_matlabpath_data_count = 32; const char *__MCC_denoise2_mcr_application_options[] = { "" }; const int __MCC_denoise2_mcr_application_option_count = 0; const char *__MCC_denoise2_mcr_runtime_options[] = { "" }; const int __MCC_denoise2_mcr_runtime_option_count = 0; #ifdef __cplusplus } #endif
  • 82. Ultimate Speech Search Page 73 CHAPTER 5 5.0 Implementation The Agile development process was chosen for the development. The system went on three iterations. In the first iteration the basic objective was to build a speech recognition engine. Various methods were tested out. But in the first iteration the speech recognition engine was built. Figure 24: SR Engine
  • 83. Ultimate Speech Search Page 74 The figure below shows the functionalities in speech recognition engine. It can open .wav file to play or to recognize speech. Figure 25 Open file
  • 84. Ultimate Speech Search Page 75 Once a file selected for the recognition a user can press the start button to start the recognition process. The recognized output can be viewed in the text output section. Figure 26: Text output
  • 85. Ultimate Speech Search Page 76 The noise filtering process has done in the second iteration and it‟s completely done by using MATLAB. It doesn‟t have a user interface. In the first development the noise filtering engine was not that efficient. There were many isolated noise packets in the spectrum. But in the second development the system could achieve a remarkable performance. We have to input a noisy speech file and when we runs the program it will produced a noise filtered .wav file.
  • 86. Ultimate Speech Search Page 77 The Search engine was built on the third phase. The user has to run the search engine and it will access the local database and gives the search results. Figure 27 Speech Search Engine
  • 87. Ultimate Speech Search Page 78 CHAPTER 6 6.0 Test Plan 6.1 Background The system that built for the research project was comprises with three main parts. The speech recognition section is the key part in this application. The noise filtering part section is another key are that taken in to accounts. There‟s a text search in the system which provides the facility to search the speech by content. Because this was a technical project and with consideration of the nature of the projects, the testing criteria‟s would not looks the same compared with other projects. 6.2 Introduction As for the test plan the testing criteria‟s will based on the input speech signals for the speech recognition and noise filtering and searching criteria. Due to nature of this project we cannot make the use of industrial test plans. The project is not a commercial project. As for the speech recognition testing criteria a speech in a digital format will use. Speech recognition projects are still in the research stage. So it‟s not advisable to implement a standard heavy weight test plan. Basic test plans will sufficient to asses the testing criteria‟s mentioned in the project.
  • 88. Ultimate Speech Search Page 79 6.3 Assumptions Before declaring any assumptions it is advisable to understand the nature of the project. Within the project scope we can assume that the speech recognition engine will only works for noiseless speech inputs. The speech recognition system will only work on pure English accent only. Noisy speech will not use as inputs because the speech engine won‟t directly identify the noise factor and filters it. The system only can identify the most speaking words. It is possible to add large vocabulary. Due to the fact, the system haven‟t designed for high level language identification and processing. Noise filtering can be done on “.wav” format only. System cannot eliminate the noise factor permanently. It is not possible to use a file which have been filtered the noise for the recognition, because the speech recognition system will works on noiseless accent only. 6.4 Features to be tested For the speech recognition system a noiseless speech input in .wav format will be tested to identify the continuous speech recognition capabilities. Continuous speech recognition capability is a unique feature in modern speech recognition systems. A noisy speech file will upload to noise filtering system and it will results a noise filtered [up to a reasonable level] output file. It is possible to measure the efficiency of the noise filtering system by measuring the amount of time it will take for processing. It is not addressed here in the project. For the speech searching part the system will use a file search. The search mechanism will include an efficient file searching and text matching mechanism. Once the user typed for a phrase, the system will show the mostly containing file name. System can play a wav file before uploading for the recognition process.
  • 89. Ultimate Speech Search Page 80 6.5 Suspension and resumption criteria While the system testing process running and if there are defects there are reasons to suspense the process. Suspension criteria denote what are those reasons. According to Anon. (nd). Suspension criteria & resumption requirements The suspension criteria as follows  Unavailability of external dependent systems during execution.  When a defect is introduced that cannot allow any further testing.  Critical path deadline is missed so that the client will not accept delivery even if all testing is completed.  A specific holiday shuts down both development and testing The resumption criteria‟s as follows  When the external dependent systems become available again.  When a fix is successfully implemented and the Testing Team is notified to continue testing.  The contract is renegotiated with the client to extend delivery.  The holiday period ends. According Anon. (nd). Suspension criteria & resumption requirements Suspension criteria assume that testing cannot go forward and that going backward is also not possible. A failed build would not suffice as you could generally continue to use the previous build. Most major or critical defects would also not constituted suspension criteria as other areas of the system could continue to be tested.
  • 90. Ultimate Speech Search Page 81 6.6 Environmental needs There are few environmental needs to be met before testing the system. The environmental needs can be classified as software needs, hardware needs and legal needs. There are no legal needs because the system does not have any links with legal situations. The list of Software needs can be list down as below  Java run time environment  Matlab development software  NetBeans 6.5 or greater  Sound driver software  Windows XP operation system The hardware needs are  A computer[hardware requirements were specified in another chapter under system requirements]  Multimedia devices
  • 91. Ultimate Speech Search Page 82 6.7 System testing Speeches and lectures with different accent [English only USA and British]:- In order to test the speech recognition engines accuracy it will tested against different accents. The expected results must be in a minimum difference with minimum errors. Content Search:- when the user tries to search by the content by typing a word or a phrase the appropriate search result will be displayed. The speech or the lecture containing the specified words or the phrase will be displayed
  • 92. Ultimate Speech Search Page 83 6.8 Unit testing The initial testing was the initial user interface. At the first glance the system only loads with the basic interactions with the user. The system doesn‟t load any calculation or extraction functionalities before a user provides a correct input for the system. Test Case Test Case One Description The user runs the Speech recognition System for the first time Expected Output Open, Start and Open Speech buttons set enabled. Encode To wav, Noise Filter buttons remain disabled. The area below open a speech file shows blank. Text output must show blank. Actual Output Open, Start and Open Speech buttons set enabled. Encode To wav, Noise Filter buttons remain disabled. The area below open a speech file shows blank. Text output must show blank. Actual output acquired. Table 9 Test Case 1 On the initial run the speech recognition system won‟t load with any algorithms. After giving an input the system will load the necessary components for processing. This mechanism will utilize the system resources.
  • 93. Ultimate Speech Search Page 84 The second testing criteria begin when the user provides and input to the system. This test case interacts with the speech recognition system‟s input. The input can be a .wav file. Test Case Test Case Two Description The user opens a file to feed the speech recognition system The user provides for the system with .wav file. The first input speech contains digits in the range of one to nine in British accent. File must be a noise free file. Expected Output Identified names of the digits needs to be display in text output area. Actual Output Due to variations in dialect the expected results would not the same. Within the range of one to nine the system identifies the digits and displays the output. Table 10 Test Case 2 The identification of digits can be extending beyond ten. Once the name of the digit to be identified becomes longer, the system identifies the digits with an error rate.
  • 94. Ultimate Speech Search Page 85 The third testing criterion was based on the user inputs a file with noise for identification. The system does not work for files contains with noise. Test Case Test Case Three Description The user provides for the system with a .wav file with noise Expected Output The system will throws an error or the system shows no results Actual Output The actual output varies due to different noise levels. If the density of the noise lays within a higher range the system go for an error. The error can be “severe null”. The system will go blank results due to the fact that the words are merely in an identifiable stage. Table 11 Test Case 3 The system doesn‟t have any functionality to measure the noise levels. The project scope won‟t cater for in depth noise analysis. The levels of noise mentioned above were measured in user experience. The system assumes that the users would not upload files with noise to the system and this rule clearly mentioned in assumptions.
  • 95. Ultimate Speech Search Page 86 The fourth testing c criterion is to check the systems speech recognition capabilities with words. Test Case Test Case Four Description The user provides for the system with a .wav file containing basic words. The input doesn‟t contain any noise. Expected Output The system identified all the words and shows the output in a more precise manner. Actual Output The system identified words with an error rate. The error rate is fluctuates between from 20% to 35%. Not all the words will identify by the system. Table 12 Test Case 4 The System doesn‟t identify all the words. The identification process depends on the speed of the utterance rate and the intensity of the phoneme. Higher intensities on phoneme help any speech recognition systems to achieve more precise results.
  • 96. Ultimate Speech Search Page 87 Test case five tests the performance of noise filtering. The noise filtering system was built in MATLAB. Test Case Test Case Five Description The user provides the noise filtering system with a noise file. The input file must in .wav format. The user has to open the MATLAB Scripts, import them to working directory and need to run. The file to be input need to be in the same directory. Expected Output An output file should be create in the working folder with the name “output.wav” “output.wav” file contains the noise filtered version of input file. Amplitude of the output file should not have a difference which can identify by a human. Actual Output Output file creates in working folder. Output file has a lowered noise relative to the input file. Output file is not noise free. Amplitude has a different which can identify by a human ear. Table 13 Test Case 5 Still there isn‟t a mechanism to remove the noise for 100%. The system will works on predefined algorithms.
  • 97. Ultimate Speech Search Page 88 Test case six tests the criterion for search functionality. The search functionality acts as a speech search engine. Test Case Test Case Six Description The user has to run the search engine. Port 8080 must be free. Expected Output When user types a phrase to search on search engine and press search button. If there‟s a match in the database it will show true. If there‟s no match the results will show as false. Actual Output If a match was found “true” displays in the results. If no match “false” displays in results. Table 14 Test Case 6 The system doesn‟t build for actual speech engine. It will only demonstrate how the speech search engine works. As for future enhancements it‟s possible to build an actual search engine.
  • 98. Ultimate Speech Search Page 89 6.9 Performance Testing The System‟s performance was tested in different operating systems. Operating systems include virtual operating environments. The absolute operating system in order to take the measurements was taken as the Microsoft Windows XP. Operating System Microsoft Windows XP Speech recognition engine configuring time Between 0.5 seconds and 1 second Efficiency of Speech recognition Input signal which having greater phoneme intensity, free from noise and duration less than 10 seconds with low word density will take around 1 second to 12 seconds. Input signals which have many words will take longer times. Efficiency of Noise filtering and MATLAB Noise filtering system generates the output less than 200 milliseconds for .wav file clips which having a duration between 2 to 10 seconds. Performance of Speech search engine Startup time for the Speech Search has an average of 8 to 15 seconds. Table 15: Performance testing windows XP
  • 99. Ultimate Speech Search Page 90 The performance of the speech search engine totally depends on the operating system. As for an example the windows operation systems use much more resources than UNIX based operating systems. The speech search system runs on the Glassfish server. The glassfish server has more performance in UNIX based operating systems. In windows environments the speech search engine has many deadlocks. Operating System Ubuntu 9.04 Speech recognition engine configuring time Between 0.2 seconds and 0.8 second Efficiency of Speech recognition Input signal which having greater phoneme intensity, free from noise and duration less than 5 seconds with low word density will take around 1 second to 5 seconds. System has a greater positive effect when it‟s work on Ubuntu environments. Efficiency of Noise filtering and MATLAB Noise filtering was efficient compared with windows environment. Performance of Speech search engine The search and the startup time of the search engine were efficient compared with windows XP. Table 16 : Performance Testing on UBUNTU
  • 100. Ultimate Speech Search Page 91 Once the Search engine runs on many times in windows environment it has a higher potential of crashing and it would not provide the correct results. When the system uses to perform the speech recognition for several times the efficiency of the recognition slows down. Java runs on a virtual environment and the recognition process needs a higher processing power. Due to those factors the efficiency of the system will degrade as it uses over and over.
  • 101. Ultimate Speech Search Page 92 6.10 Integration Testing Integration testing is a logical extension of unit testing. Integration testing identify drawback when a system combines. Before performing integration System for the system it comprises with different systems with different functionalities. It is not possible to combine the noise filtering system with speech recognition system or search web browser,. An overall test mechanism used for integration testing due to the fact that the system comprises with sub systems which indirectly have a connection with each other. Big Bang testing Big Band testing is the process of taking the entire unit testing criteria for a System and ties them together. This approach mostly suitable for small systems and May results many unidentified errors on testing stages. If a developer has done unit testing correctly, Bug bang testing will helps to uncover more errors and it will save money and time. In the system after performing the big bang testing the following faults were recovered.  The continuous functionality of the search engine cannot guaranty.  If the length of the input signal was long there will be a system out of memory error. The disadvantages in Big bang testing are  Cannot start integration testing until all the modules have been successfully evaluated.  Harder to track down the causes for the errors
  • 102. Ultimate Speech Search Page 93 Incremental Testing Incremental Testing allows you to compare and contrast two functionalities with you are testing. You can add and test for other modules within the testing time. Incremental testing cannot perform to the system because there are no parallel functionalities within the system which interact each others.
  • 103. Ultimate Speech Search Page 94 CHAPTER 7 CRITICAL EVALUATION AND FUTURE ENHANCEMENTS 7.1Critical evaluation The entire project was about speech recognition using digital signal by input and search by content. The project is a union of several other research areas. At the initial stage the research was focused on to speech recognition. The barriers met in the initial stage  Human speech recognition At the beginning there wasn‟t a way to explain the speech recognition process, the mechanisms behind that and how it was performed.  Speech recognition engine Study of speech recognition engine was a crucial part for the design phase. There was no speech recognition engine to analyze or to study. In order to overcome those two factors first of all the functionality of speech recognition was essential. After understanding a system and completion of a basic sketch of the flow diagram, it seems a sufficient starting point for the development. When talks over a microphone, it so easy to record a human voice. After recording the human voice is no longer in analogue format. The obvious digital format was a .wav file. The system going to performs the speech recognition for the file in .wav format.
  • 104. Ultimate Speech Search Page 95 In the development phase the study of speech recognition system didn‟t help much for further proceedings. Because when it comes to audio formats the digital signal processing part was hidden. The system has to address the DSP part in a reasonable manner. Digital signal processing is about concerned with the representation of the signals by a sequence of numbers or symbols and the processing of these signals. Within the course content we studied there wasn‟t a single module that tough us about Interface programming, micro control programming or digital signal processing. Building a functionality to handle the digital signal processing part from the scratch was a tedious job. The knowledge that we have to build such functionality wont sufficient compared with the time. At the initial stage the plan was to develop the entire system in JAVA .but java didn‟t have a built in proper API or luxuries to handle digital signal processing. However there were some reliable third party components that merely manage to perform the task. Plug in the third party tools was another issue. But finally manage to find codes in order to accomplish the task. There was few speech recognition systems were built using java. But the fact that they were not built for continuous speech recognition or for noise reduction. There were many issues in the first place. We have to define a grimmer format. There were two options. One is to go with JVXML. Java voice xml is a technology which provides speech synthesis capabilities and recognition capabilities. We can embed voice commands for web sites using voice xml. As for the project I have choose JSGF or JAVA speech and grammar format. Java speech and grammar format supports inbuilt dictionaries which capable to support digits and words. We can plug multi language capabilities. When developing systems using JAVA it‟s always advisable to use the components that easily support JAVA platform capabilities.
  • 105. Ultimate Speech Search Page 96 The speech recognition system can split in to two systems. The Java virtual machine allocates maximum of 128m memories for NetBeans. We cannot explicitly define the amount of the virtual machine when we are working with net beans. The digits recognition part can be performed and implemented in NetBeans development environment. But it‟s not possible to free the memory for recognize speech that contain words. They need a higher level of virtual memory from the virtual machine. Because of that the speech recognition for words had to run in Command prompt explicitly saying “java –mx256m - jar”. This command allocated 256m virtual memory for speech recognition. Noise filtering was another unsolved issue that had to answer through the system. For noise filtering there was no proper support in JAVA. If you are doing a technical project its essential to develop in 4th generation languages or languages like C , assembly .