Speech Recognition , Noise Filtering and Content Search Engine , Research Document
1. Ultimate Speech Search
Page i
ABSTRACT
In the modern era people tends to find information where ever they can in a more
efficient way. They search for the knowledge from past events so does the present events.
Searching for a particular thing evolves a search engine and the necessary information. When
they want to learn out of speeches or lectures done by any one they are going for a desperate
search without knowing the actual results. If they have a luxury of a search engine that would
give the required results that would be a blessing for their work.
This project totally aims for build a search engine that will able to search for
speeches and lectures by their content. Every search engine supports the feature of searching,
but the results may be a jargon. The user has to go one by one and sometimes at the end of
the day they will end up will a null result. The main goal of this project is to provide a search
facility by the content.
This research covers converting a speech in to text with a bit of noise analysis,
maintaining a database with clustered indexing and a simple search facility by the content.
The system that would build operates on a limited data such as speeches and lectures in a low
noisy environment and as for the future enhancement it would be able to search for music or
any other sound stream by the analysis of the spectrum with user friendly search facility.
KEY WORDS Search Engine, Speeches, Lectures, Noise Analysis, Content, Spectrum
2. Ultimate Speech Search
Page ii
ACKNOWLEDGEMENTS
My sincere gratitude goes to my grandfather who taught me the ways of life and who
raised me up from my childhood to a teenager and left me in a May.
I would like to thank to my friends those who help me in my difficult times and praised me in
my good times. I would like to thank to my college teachers who beat me from canes to make
me a good man and gave me the knowledge to face the society.
I would like to thank for my sister who always be a mother to me and I would like to show
my gratitude for my supervisor Mrs. Nadeera Ahangama who guide to throughout the project.
Finally I would like to thank to the APIIT staffs who provide us with necessary facilities to
achieve our higher education and make it a success.
3. Ultimate Speech Search
Page iii
Table of Contents
ABSTRACT................................................................................................................................i
ACKNOWLEDGEMENTS.......................................................................................................ii
List of Figures..........................................................................................................................vii
List of Equations.................................................................................................................... viii
List of Tables ............................................................................................................................ix
INTRODUCTION .....................................................................................................................1
1.1 Project Background.....................................................................................................1
1.2 Problem Description....................................................................................................2
1.3 Project Overview.........................................................................................................4
1.3.1 Noise analysis ......................................................................................................4
1.3.2 Speech recognition...............................................................................................4
1.3.3 Speech to text conversion ....................................................................................4
1.3.4 The database.........................................................................................................5
1.3.5 The search engine ......................................................................................................5
1.4 Project Scope...............................................................................................................6
1.5 Project Objectives .......................................................................................................7
RESEARCH...............................................................................................................................8
2.1 Speech Recognition..........................................................................................................8
2.2 Speech recognition methods...........................................................................................13
2.2.1 Hidden Markov methods in speech recognition......................................................13
2.2.2 Client side speech recognition.................................................................................16
2.2.5 Continuous speech recognition................................................................................18
2.2.6 Direct Speech Recognition ......................................................................................18
2.3 Speaker Characteristics ..................................................................................................19
2.3.1 Speaker Dependent..................................................................................................19
2.3.2 Speaker Independent................................................................................................19
2.3.3 Conclusion...................................................................................................................20
2.4 Speech Recognition mechanisms...................................................................................21
2.4.1 Isolated word recognition........................................................................................21
4. Ultimate Speech Search
Page iv
2.4.2 Continuous speech recognition................................................................................22
2.4.3 Conclusion...............................................................................................................23
2.5 Vocabulary Size .............................................................................................................24
2.5.1 Limited Vocabulary.................................................................................................24
2.5.2 Large Vocabulary ....................................................................................................24
2.5.3 Conclusion...............................................................................................................24
2.6 Speech recognition API‟s...............................................................................................25
2.6.1 Microsoft Speech API 5.3 .......................................................................................25
2.6.2 Java Speech API......................................................................................................26
2.7 Speech Recognition Algorithms ....................................................................................31
1. 8 Noise Filtering ...........................................................................................................32
1.8.1 Weiner filtering..................................................................................................33
1.8.2 Conclusion .........................................................................................................33
2.9 Database and data structure............................................................................................34
2.9.1 Conclusion...............................................................................................................34
2.10 Search Engine...............................................................................................................35
2.11 MATLAB.....................................................................................................................36
ANALYSIS..............................................................................................................................37
3.0 System requirements .................................................................................................37
3.11 Functional requirements ........................................................................................37
3.1.2 Non functional requirements ...................................................................................37
3.1.3 Software Requirements............................................................................................38
3.1.4 Hardware requirements............................................................................................39
3.2 System Development Methodologies.............................................................................40
3.2.1 Rational Unified Process .........................................................................................40
3.2.2 Agile Development Method ....................................................................................43
3.2.3Scrum Development Methodology...........................................................................45
3.3 Test Plan.........................................................................................................................47
3.3.1System testing...........................................................................................................47
SYSTEM DESIGN..................................................................................................................48
4.1 Use Case Diagram.....................................................................................................48
5. Ultimate Speech Search
Page v
4.2 Use case description.......................................................................................................50
4.2.1 Use case description for file upload ........................................................................50
4.2.2Use Case description for play an audio file..............................................................51
4.2.3 Use Case description for search...............................................................................52
4.2.4 Use Case description for noise reduced output .......................................................53
4.2.5 Use Case description for noise filtering ..................................................................54
4.3 Activity Diagrams ..........................................................................................................55
4.3.1Activity Diagram for Speech Recognition System...................................................55
4.3.2 Activity Diagram for Noise filtering .......................................................................56
4.4 Sequence Diagrams........................................................................................................57
4.4.1 Select a file ..............................................................................................................57
4.4.2 Play wav file ............................................................................................................58
4.4.3Speech recognition pre stage ....................................................................................59
4.4.4Speech Recognition post stage .................................................................................60
4.5 Class Diagrams...............................................................................................................61
4.5.1 GUI and the system .................................................................................................61
4.5.2 Speech recognition ..................................................................................................62
4.6 Noise Filtering................................................................................................................64
4.7 Code to filter noise in C Language.................................................................................67
CHAPTER 5 ............................................................................................................................73
5.0 Implementation ..................................................................................................................73
CHAPTER 6 ............................................................................................................................78
6.0 Test Plan.............................................................................................................................78
6.1 Background ....................................................................................................................78
6.2 Introduction....................................................................................................................78
6.3 Assumptions...................................................................................................................79
6.4 Features to be tested .......................................................................................................79
6.5 Suspension and resumption criteria ...............................................................................80
6.6 Environmental needs......................................................................................................81
6.7 System testing ................................................................................................................82
6.8 Unit testing.....................................................................................................................83
7. Ultimate Speech Search
Page vii
List of Figures
Figure 1: Overview of Steps in Speech Recognition.................................................................8
Figure 2 : Graphical Overview of the Recognition Process ....................................................10
Figure 3: Components of a typical speech recognition system................................................12
Figure 4 : example of HMM for word “Yes” on an utterance.................................................15
Figure 5: Overview of Microsoft Speech Recognition API ...................................................25
Figure 6 : Java Sound API Architecture ..................................................................................29
Figure 7 : JSGF Architecture...................................................................................................30
Figure 8: Noise in Speech........................................................................................................32
Figure 9 : Database Indexing...................................................................................................34
Figure 10 : Google Architecture ..............................................................................................35
Figure 11 Phases in RUP .........................................................................................................41
Figure 12 : Overview of Agile.................................................................................................43
Figure 13 : Scrum Overview....................................................................................................46
Figure 15 : Use Case Diagram for System...............................................................................48
Figure 16 Speech Recognition.................................................................................................55
Figure 17 Activity Diagram Noise Filtering...........................................................................56
Figure 18 Sequence Diagram Select a file...............................................................................57
Figure 19 Sequence Diagram Play File ...................................................................................58
Figure 20 Sequence Diagram SR Pre Stage............................................................................59
Figure 21 Sequence Diagram SR Post Stage..........................................................................60
Figure 22 Class Diagrams GUI & System...............................................................................61
Figure 23 Class Diagram SR System.......................................................................................62
Figure 24 : Speech Search Class Diagram...............................................................................63
Figure 25: SR Engine...............................................................................................................73
Figure 26 Open file..................................................................................................................74
Figure 27: Text output .............................................................................................................75
Figure 28 Speech Search Engine .............................................................................................77
8. Ultimate Speech Search
Page viii
List of Equations
Equation 1 : First order Markov chain.....................................................................................13
Equation 2: Stationary states Transition ..................................................................................14
Equation 3: Observations independence..................................................................................14
Equation 4: observation sequence............................................................................................14
Equation 5 : Left Right topology constraints...........................................................................15
Equation 6: CSR Equations .....................................................................................................22
9. Ultimate Speech Search
Page ix
List of Tables
Table 1: Typical parameters used to characterize the capability of speech recognition system9
Table 2 : Comparison in different techniques in speech recognition.......................................17
Table 3: Isolated word recognition ..........................................................................................21
Table 4 : Use Case description file upload ..............................................................................50
Table 5 Use Case description play audio.................................................................................51
Table 6 Use Case description search .......................................................................................52
Table 7 Use Case description noise reduction .........................................................................53
Table 8 Use Case description noise process ............................................................................54
Table 9 Test Case 1..................................................................................................................83
Table 10 Test Case 2................................................................................................................84
Table 11 Test Case 3................................................................................................................85
Table 12 Test Case 4................................................................................................................86
Table 13 Test Case 5................................................................................................................87
Table 14 Test Case 6................................................................................................................88
Table 15: Performance testing windows XP............................................................................89
Table 16 : Performance Testing on UBUNTU ........................................................................90
10. Ultimate Speech Search
Page 1
CHAPTER 1
INTRODUCTION
1.1 Project Background
Throughout the history of human civilization time played a key role. Humans achieved
Technological advancement, scientific breakthroughs and unfortunately drawbacks within
certain time goals. In many cases these time goals were set by nature.
According to sooths point of view now we are live in an advanced era compared to
prehistoric eras. We all are actors in another part of a chronicle play in our time. Due to the
globalization distances in this planet narrowing. Within a shorter time limit people forced to
accomplish objectives and goals and most of the time they are lacking certain amount of time
in order to make it a success.
Some part of a society ask to accomplish a goal they may go for a research , interviews or
various any other fact finding techniques. Just imagine that they need to find certain
information from lectures and speeches. Can they find the appropriate resource materials in a
minimum time and with a minimum effort?
They have to go through many search results and they have to commit most of their valuable
time for a worthless task. If there is a way to find the lectures and speeches by searching by
their content we could guarantee that we can save our valuable time in a respectable manner
and we can invest this valuable time for deeds in sake of the planet earth.
11. Ultimate Speech Search
Page 2
1.2 Problem Description
The problem is to provide with the users with a search engine in order to search lectures and
speeches by their content for various purposes.
In order to do this we have to come up with fair solutions for the challenges that meet
throughout this process and they are as follows.
Noise analysis: - we have to analyze the nature of the speech or the lecture. Speeches and
lectures may come from various surrounding environmental conditions. This may directly
effect to the vocal part of the speech. So we have to reduce the noise as much as possible.
Speech recognition: - speech recognition is a vast area. Speeches can be done by many
personalities with different accents. Each individual has his/her own accent when speaking in
English or any other languages. In order to recognize the words they spoken we have to do a
deep research in order to build a speech recognition server to overcome the speech
recognition challenge.
Speech to text conversion:-Speech to text conversion is one of the key areas of this project
because it‟s the key point to build the database that contains the text version of speeches and
lectures.
The database: - All the converted versions of the speeches and lectures will be saved in the
database.
The search engine: - This is another challenging area of the project. The search engine will
show the appropriate search results from the database. I need to find the searching
12. Ultimate Speech Search
Page 3
mechanisms and methods for the search in order to give the user with efficient and accurate
results.
Database and the search engine are two parallel problems that need to be developed
more precisely. Without a proper structure for the database it‟s tedious to implement search
functionality.
13. Ultimate Speech Search
Page 4
1.3 Project Overview
The main challenge area of this project is to build the database containing the text version of
speeches and lectures. In order to accomplish these phenomena we have to perform some
tasks.
1.3.1 Noise analysis
A noise analysis will perform in order to ensure an efficient speech to text conversion. This
will enables us to isolate the human voice and remove the background environment in the
audio file. This may include background noise such as tape hiss, electric fans or hums, etc.
1.3.2 Speech recognition
Speech recognition comes in two flavors. They are speaker independent and speaker
dependent. The voice of the speaker or the lecturer may change. Because of that the project
uses speaker independent speech recognition.
1.3.3 Speech to text conversion
The system converts the speech in text format in order to build the database. The database
consists with the converted text version of the speeches and the lectures.
14. Ultimate Speech Search
Page 5
1.3.4 The database
The database consists two parts. They are the converted (speech to text) speech file or the
lecture file and the actual source files contains audio.
1.3.5 The search engine
The search engine search for the content of a speech or a lecture from the database and gives
the actual results. We might need to do something like summarizing. So the user can search
from the content more easily by typing a sentence or a word.
15. Ultimate Speech Search
Page 6
1.4 Project Scope
Existing search engines wont facilitates for search for a speech by its content. This system
gives you the facility to search a speech by its content. The system contains data about
English speeches and lectures.
These speeches and lectures were done in a low noisy environment because the system
would perform a less noise analysis. The system won‟t store music because the amount of
noise analysis in higher compared to a low noisy environment.
The speech recognition engine that going to build only supports for the English speeches and
lectures and the noise analysis will only supports for the English speeches and lectures and
speeches.
The system will convert speeches and lectures (low noise) to text format. After the
development process users will able to search from anywhere on this planet for a required
result.
Speaker independent speech recognition will be used because the system deals with different
type of speeches performed by different persons with different accents.
16. Ultimate Speech Search
Page 7
1.5 Project Objectives
1.0 Noise analysis and reduction
The system will performs noise filtering. This helps the speech recognition process. The
noisy signal channel will analyzed and split in to two parts. Amplitude of the noisy channel
set to low in value. An efficient noise filtering mechanism will use.
2.0 Continuous speech recognition system
To develop an efficient speech recognition engine to convert speeches and lectures to a text
format Speeches performed by various persons will be translated in to text format.
3.0 The Database
Database implementation Converted version of the speeches and lectures will be stored in the
data base in text format and the relevant speech or the lecture will be stored in another
database
4.0 The search engine
The search engine search for the content of a speech or a lecture from the database and gives
the actual results. We might need to do something like summarizing. So the user can search
from the content more easily by typing a sentence or a word.
17. Ultimate Speech Search
Page 8
CHAPTER 2
RESEARCH
2.1 Speech Recognition
The process of converting a phonic signal captured by a phone or a microphone or any other
audio device to a set of words is called speech recognition. Speech recognition is used in
command based applications such as data entry control systems, documentation preparation,
automation of telephone relay systems, in mobile devices such as in mobile phones and to
help people with hearing disabilities.
According to Professor Todd Austin (2007) Speech recognition is the task of translating an
acoustic waveform representing human speech into its corresponding textual representation.
Source(Aoustin,T. (2007). Speech Recognition. Available:
http://cccp.eecs.umich.edu/research/speech.php. Last accessed 17 July 2009. )
Figure 1: Overview of Steps in Speech Recognition
18. Ultimate Speech Search
Page 9
Applications that support speech recognition are “introduced on a weekly basis and speech
technology are rapidly entering new technical domains and new markets” (Java Speech API
Programmers Guide, 1998)
According to Zue et al. (2003), Speech recognition is a process that converts an acoustic
signal which can be captured by a microphone, to a set of words. Speech recognition systems
can be categorized by many parameters.
Parameters Range
Speaking mode Isolated words to continues speech
Speaking Style Read Speech to spontaneous speech
Enrolment Speaker dependent to speaker independent
Small Small (<20 words) to large (>20000 words)
Language Model Finite state to context sensitive
Perplexity Small(<10) larger(>100)
SNR High(>3dB) to low (<20dB)
Transducer Voice cancelling microphone to telephone
Table 1: Typical parameters used to characterize the capability of speech recognition
system
19. Ultimate Speech Search
Page 10
According to Hosom et al. (2003), “The dominant technology used in Speech Recognition is
called the Hidden Markov Model (HMM)”. There are four basics steps in performing speech
recognition. They can be seen in the figure below.
[Source: Hosom et al., 1999]
Figure 2 : Graphical Overview of the Recognition Process
20. Ultimate Speech Search
Page 11
During pass few years speech recognition systems have achieved a remarkable success such
their capability of recognition accuracy rate sometimes results over 98 percent. But that such
accuracy rate was achieved in quite environments and by using sample words in training. It
has been said that a good speech recognition system must be able to achieve good
performance in many circumstances such as a noisy environment. Noise can come on many
flavors.
Air conditions , fans , radios , coughs , tape hiss , cross talks channel distortions , lips smack
, breath noise , pops , sneeze are the basic factors that are engage in making a noisy
environment.
Typical component of a speech recognition system composed of Training data , Acoustic
model , Language model , Training model, Lexical model, Speech signal, Representation,
Model Classification , Search and Recognize words.
21. Ultimate Speech Search
Page 12
The figure below shows these components geometry in a speech recognition system.
Figure 3: Components of a typical speech recognition system.
22. Ultimate Speech Search
Page 13
2.2 Speech recognition methods
There is only few speech recognition methods are prevailing. They are categorizing as for the
mobile devices and for standalone applications.
2.2.1 Hidden Markov methods in speech recognition
Andre Markov is the founder of Markov process. Markov model involves probability and it
uses over a finite sets usually called its states.
When a state transition occurs it generates a character from the process. This model has a
finite state Markov chain and a finite set of output probability distribution. Hidden Markov
Constrains for speech recognition systems
1 – First order Markov chain.
This has been made by the assumption that the probability of transition to a state depends
only on the current state
𝑃 𝑞𝑡 + 1 =
𝑆𝑗
𝑞𝑡
= 𝑆𝑖 , 𝑞𝑡 − 1 = 𝑆𝑘 , 𝑞𝑡 − 2 = 𝑆𝑤 , … . . , 𝑞𝑡 − 𝑛 = 𝑆𝑧 𝑃 𝑞𝑡 + 1 = 𝑆𝑗
𝑞𝑡
= 𝑆𝑖
Equation 1 : First order Markov chain
23. Ultimate Speech Search
Page 14
2 – Stationary states Transition.
This assumption proved that the state changes are mutually exclusive from the time.
𝑎𝑖𝑗 = 𝑃 𝑞𝑡 + 1 = 𝑆𝑗 𝑞𝑡 = 𝑆𝑖
Equation 2: Stationary states Transition
3 – Observations independence.
This assumption regards to the state changes depend only on the underline Markov chain.
However this assumption was depreciated.
𝑃
𝑂𝑡
𝑂𝑡
− 1, 𝑂𝑡 − 2, … . . , 𝑂𝑡 − 𝑝 , 𝑞𝑡 , 𝑞𝑡 − 1 , 𝑞𝑡 − 2 , … . 𝑞𝑡 − 𝑝
= 𝑃
𝑂𝑡
𝑞𝑡 , 𝑞𝑡 − 1 , 𝑞𝑡 − 2 , … . 𝑞𝑡 − 𝑝
Equation 3: Observations independence
Where “p “represents considered history of the observation sequence.
𝑏𝑗 𝑂𝑡 = 𝑃
𝑂𝑡
𝑞𝑡
= 𝑗
Equation 4: observation sequence.
24. Ultimate Speech Search
Page 15
4 – Left-Right topology constraint:
𝑎𝑖𝑗 = 0 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑗 > 𝑖 + 2 𝑎𝑛𝑑 𝑗 < 𝑖 { 1 𝑓𝑜𝑟 𝑖 1 1 0 𝑓𝑜𝑟 1 𝑖 𝑁 ( ) = < £
= 𝑖 𝑖 𝑝 𝑃 𝑞 𝑆
Equation 5 : Left Right topology constraints
The figure below shows an example of HMM for word “Yes” on an utterance.
Figure 4 : example of HMM for word “Yes” on an utterance
25. Ultimate Speech Search
Page 16
2.2.2 Client side speech recognition
According to Hosom et al. (2003), Client Side - Speech Recognition is technology that allows
a computer to identify the words that a person speaks into a microphone or telephone. The
basic advantages of having client side speech recognition are it assures a faster response time
because all the processing handled in the client side. The other advantage is it does not use
any network connections like GPRS. According to Hagen at el. (2003, p.66) the problems of
client side speech recognition is, Recognition accuracy and Running time (power
consumption).
2.2.3 Dynamic Time wrapping based speech recognition
This method was used in past decades but now has been depreciated. This algorithm
measures similarities between two sequences which may vary in time or speed. Number of
templates being used in order to perform automatic speech recognition in Dynamic Time
Wrapping based speech recognition. This process involves normalization of distortion and the
lowest normalized distortion is identified as a word.
2.2.4 Artificial Neural Networks
The mechanism inside ANN is to filter the human speech frequencies from the other
frequencies due to the fact that the non speech sound covers higher frequency range than
speech.
26. Ultimate Speech Search
Page 17
The table below shows a comparison between different speech recognition mechanisms.
Source [anon. (nd). School of Electrical, Computer and Telecommunications
Engineering. Available: http://www.elec.uow.edu.au/staff/wysocki/dspcs/papers/004.pdf].
Last accessed 23rd
August 2009.]
Table 2 : Comparison in different techniques in speech recognition
27. Ultimate Speech Search
Page 18
2.2.5 Continuous speech recognition
Continuous speech recognition applies is used when a speaker pronounce words sentence or
phrase that are in a series or specific order and are dependent on each other, as if linked
together. This system operates on a system that words are connected to each other and not
separated by pauses.
Because there is more variety of effects it‟s a tedious task to manipulate it. Co articulation is
another series issue in continuous speech recognition. . The effect of the surrounding
phonemes to a single phoneme is high. Starting and ending words are affecting by the
following words and also affected by the speed of the speech.
It‟s harder to track down a fast speech. Two algorithms are usually involves in Continuous
speech recognition. They are Viterbi Algorithm and Baum Welch Algorithm.
2.2.6 Direct Speech Recognition
This process is responsible for identify the speech such that from a word by word and it
follows by pauses.
28. Ultimate Speech Search
Page 19
2.3 Speaker Characteristics
2.3.1 Speaker Dependent
Speaker Dependent speech recognition systems are developed for a single user purpose only.
No other user can use the system and it will function with only a single user. These systems
subjected to train by the user for the functionality purpose.
One such advantage is that these kinds of systems support more vocabulary than the speaker
independent system and the disadvantage is the limitation of usage for the type of users. This
technology is used in steno masks
.
2.3.2 Speaker Independent
Speaker Independent speech recognition systems are harder to implement relative to the
speaker dependent speech recognition systems. The system need to recognize the patterns and
different accents spoken by many users. The advantage of this system is it can be used by
many users without training.
The most important steps in order to build a speaker independent SRS is to identify what
parts of speech are generic, and which ones vary from person to person. The Speaker
dependent speech recognition can be used by many users despite they are harder to
implement.
29. Ultimate Speech Search
Page 20
2.3.3 Conclusion
Speaker Independent speech recognition system has been selected for the project because the
system has to deal with many speeches done by many users.
The speech accent and phoneme patterns are different from a speaker to a speaker and it‟s not
possible to perform an individual training for each and every speaker.
Java Speech API only supports for speaker independent speech recognition systems and
that‟s another reason to select speaker independent speech recognition.
30. Ultimate Speech Search
Page 21
2.4 Speech Recognition mechanisms
2.4.1 Isolated word recognition
This identifies a single word at a time and pauses are involved between words. Isolated word
recognition is the primary stage of speech recognition and it widely used in command based
applications.
Isolated speech recognition needs a less processing power and primary patter matching
algorithms evolved.
Table 3: Isolated word recognition
31. Ultimate Speech Search
Page 22
2.4.2 Continuous speech recognition
According to Hunt, A. (1997) Continuous speech is more difficult to handle because it is
difficult to find the start and end points of words and Co articulation - the production of each
phoneme is affected by the production of surrounding phonemes.
According to Peinado & Segura (2006, p.9), there are three types of errors in Continuous
speech recognition systems.
Substitutions - recognized sentence have different words substituting original words.
Deletions - recognized sentence with missing words.
Insertions - recognized sentence have new/extra words. Error rate calculation in Continuous
speech recognition by Stephen at el. (2003, p.2)
𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 =
𝐻1
𝑁2
𝑥 100%
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑁 − 𝐷3 − 𝑆4 − 𝐼
𝑁
𝑥 100%
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝐻 − 𝐼
𝑁
𝑥 100%
𝑊𝑜𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒 =
𝑆 + 𝐷 + 𝐼5
𝑁
𝑥 100%
Equation 6: CSR Equations
1
Number of words correctly recognized
2
Total number of words in the actual speech
3
Deletions
4
Substitutions
5
Insertions
32. Ultimate Speech Search
Page 23
2.4.3 Conclusion
As for the project continuous speech recognition mechanism has chosen because the system
going to deal with continuous speeches in order to build a database and the back end of the
system serves as a standalone application.
33. Ultimate Speech Search
Page 24
2.5 Vocabulary Size
Vocabulary is the amount of words that known by a person. Greater the vocabulary size, the
depth that he know is higher. The same rule applies for speech recognition systems.
2.5.1 Limited Vocabulary
Limited vocabulary systems have a limited number of words. This can be varies 100 to 10000
words. These systems need a less processing power and more suitable for mobile devices.
2.5.2 Large Vocabulary
Large Vocabulary size for a speech recognition system mainly used in servers or stand alone
applications and evolves more processing power. It will identify almost every word speak by
a person. This vocabulary has more than 10000 words.
2.5.3 Conclusion
Large Vocabulary has been chosen for the project because the project‟s main processes are
handled by standalone applications and it has to collaborate with many speeches.
34. Ultimate Speech Search
Page 25
2.6 Speech recognition API’s
2.6.1 Microsoft Speech API 5.3
Microsoft Speech API reduces the coding overload from the programmers. It‟s equipped with
speech to text and text to speech recognition.
This API requires a .NET based building environment and have to purchase. Scope of Speech
Application Programming Interface or SAPI lies within windows environments. It allows the
use of speech recognition and speech synthesis within Windows applications.
Applications that use SAPI include Microsoft Office, Microsoft Agent and Microsoft Speech
Server.
In general SAPI defines a set of interfaces and classes to develop dynamic speech recognition
systems. SAPI uses two libraries for its front end and for its back end. For front end it uses
the “Fast format” library. For the back end SAPI uses the “Pantheios”. Both these are C++
open source libraries.
Figure 5: Overview of Microsoft Speech Recognition API
35. Ultimate Speech Search
Page 26
2.6.2 Java Speech API
Java Speech API provides the both speech recognition and synthesis capabilities and it is
freely available. JSAPI supports for multiple platform development and supports open source
and non open source third party tools. JSAPI package comprises with java.speech,
javax.speech.recognition and javax.speech.synthesis.
Sun Micro Systems build JSAPI in collaboration with
Apple Computer, Inc.
AT&T
Dragon Systems, Inc.
IBM Corporation
Novell, Inc.
Philips Speech Processing
Texas Instruments Incorporated
It supports speaker independent speech recognition and W3C standards.
Speech recognizer‟s capabilities:
Built-in grammars (device specific)
Application defined grammars
Speech synthesizer‟s capabilities:
Formant synthesis
Concatenate synthesis
36. Ultimate Speech Search
Page 27
Java Speech API specifies a cross-platform interface to support command and control
recognizers, dictation systems and speech synthesizers. Java Speech API has two
technologies. They are speech synthesis and speech recognition. Speech synthesis provides
the reverse process of producing synthetic speech from text generated by an application, an
apple, or a user.
With the synthesis capabilities developer‟s can build applications to generate speech from the
text.
There are two primary steps to produce speech from a text.
Structure analysis: Processes the input text to determine where paragraphs, sentences, and
other structures start and end. For most languages, punctuation and formatting data are used
in this stage.
Text pre-processing: Analyzes the input text for special constructs of the language. In
English, special treatment is required for abbreviations, acronyms, dates, times, numbers,
currency amounts, e-mail addresses, and many other forms. Other languages need special
processing for these forms, and most languages have other specialized requirements.
Speech recognition grants the privileges for the computer to listen to human speech
understand and recognize and converts in to text.
37. Ultimate Speech Search
Page 28
There are some steps in order to build a speech recognition system.
Grammar design: Defines the words that may be spoken by a user and the patterns in
which they may be spoken.
Signal processing: Analyzes the spectrum characteristics of the incoming audio.
Phoneme recognition: Compares the spectrum patterns to the patterns of the
phonemes of the language being recognized.
Word recognition: Compares the sequence of likely phonemes against the words and
patterns of words specified by the active grammars.
Result generation: Provides the application with information about the words the
recognizer has detected in the incoming audio.
Behalf of JSAPI we need another two Java API‟s. They are Java Sound API and Java
media frame work. Java sound API has the capabilities of handling sounds and it‟s
equipped with a rich set of classes and interfaces that directly deals with incoming sound
signals. Java Sound API widely used for the following areas and industries.
Communication frameworks, such as conferencing and telephony
End-user content delivery systems, such as media players and music using streamed
content
Interactive application programs, such as games and Web sites that use dynamic
content
Content creation and editing
Tools, toolkits, and utilities
Java sound API uses a hardware independent architecture. It is designed to allow different
sorts of audio components to be installed on a system and accessed by the API.
38. Ultimate Speech Search
Page 29
With Java Sound API we can process both the MIDI 6
and wav sound formats.
Java media frame work is a recently developed frame work which can be used to build
dynamic multimedia applications.
6
Musical Instrument Digital Interface
Figure 6 : Java Sound API Architecture
39. Ultimate Speech Search
Page 30
2.6.2.1 Java Speech and Grammar format
JSGF or Java speech and Grammar Format was built by the Sun Micro systems. It defines the
set of rules and words for speech recognition. JSGF is plat form independent specification
and it is derived from the Speech recognition Grammar Specification.
The Java Speech Grammar Format has been developed for use with recognizers that
implement the Java Speech API. However, it may also be used by other speech recognizers
and in other types of applications.
A typical grammar rule is a composition of what to be spoken, the text to be spoken and a
reference to other grammar rules. A JSGF file comes in a normal file format or in XML
format.
source (anon. (nd). JSGF Architecture. Available: http://www.cs.cmu.edu/. Last accessed
24th
july 2009.)
Figure 7 : JSGF Architecture
40. Ultimate Speech Search
Page 31
2.7 Speech Recognition Algorithms
Viterbi Algorithm is widely used in speech recognition. It is supports for dynamic
programming. This algorithm directly deals with the hidden Markov methods. Baum Welch
Algorithm is another algorithm used in this process. It evolves probability and maximum
likelihood. Forward Backward algorithm is another algorithm used in this process and it is
directly deals with hidden Markov methods. There are three steps in this algorithm.
Computing forward probabilities
Computing backward probabilities
Computing smoothed values
A combination of the above algorithms (a customized version) will use in the project.
41. Ultimate Speech Search
Page 32
1. 8 Noise Filtering
Noise can be emerged in a speech by tape hiss, clapping, cough or by any other relevant
environmental or machinery factors. Noise plays a major role in the play of speech
recognition.
Source (anon. (nd). Departement Elektrotechniek. Available:
http://www.esat.kuleuven.be/psi/spraak/theses/08-09-en/clp_lp_mask.png. Last accessed 22
September 2009)
Figure 8: Noise in Speech
42. Ultimate Speech Search
Page 33
According to Khan, E., and Levinson, R (1998) Speech recognition has achieved quite
remarkable progress in the past years.
Many speech recognition systems are capable of producing very high recognition
accuracies (over 98%).
But such recognition accuracy only applies for a quiet environment (very low noise)
and for speakers whose sample words were used during training.
Spectral subtraction and Weiner filtering are the two most popular methods that are available
in noise reduction because they are straight forward to implement.
1.8.1 Weiner filtering
Weiner filtering is a common model that applies for filtering noise. z(k), is a signal, s(k), plus
additive noise, n(k), that is uncorrelated with the signal z(k) = s(k) + n(k). If the noise is also
stationary then the power spectra of the signal and noise add 𝑃𝑧 𝑤 = 𝑃𝑠 𝑤 + 𝑃, 𝑤
1.8.2 Conclusion
Weiner filtering method has been chosen to the project because it is widely acceptable
method and so easy to implement.
43. Ultimate Speech Search
Page 34
2.9 Database and data structure
Database contains the text version of speeches and their location. Sample database maintains
in the hard disk and the locations are saved in a file. Database indexing used for efficient
search results.
Database indexing improves the speed of data structure. Indexing can be divided in to two
parts that is clustered and none clustered.
None clustered indexing doesn‟t bother about the order of the actual records. This results
additional input and output operations to get the actual results.
In clustering indexing it reorders data according to their indexes as data blocks. It‟s more
efficient for the searching purposes.
2.9.1 Conclusion
Clustered indexing has been chosen for the project because the system evolves search
operation for speeches.
Figure 9 : Database Indexing
44. Ultimate Speech Search
Page 35
2.10 Search Engine
Search engine basically act as the terminal for searching speeches and lectures. It will check
for search results in locally deployed database that contains the text version of speeches and
lectures. A search engine operates in the order of web crawling, indexing and searching.
Source(Sergy ,B. Lawrence,P.. (nd). The Anatomy of a Large-Scale Hypertextual Web Search
Engine. Available: http://infolab.stanford.edu/~backrub/google.html. Last accessed 24 march
2009.)
Figure 10 : Google Architecture
45. Ultimate Speech Search
Page 36
2.11 MATLAB
MATLAB was developed by MathWorks. MathWorks is a privately held multinational
company. They are specialized in technical software.
MATLAB is a multi platform fourth generation programming language. Just like other many
languages MATLAB supports the following features.
Matrix manipulation
Plotting of functions and data
Algorithm implementation
Create Graphical user interfaces
Interfacing with other programming languages
Most of the MATLAB code snippets show a numerical nature. Regardless of that factor by
using MATLAB we can build systems in a more precise manner and the line of codes that
required buildings the system are relatively few compared with other languages such as
JAVA or C#.
Just like other object oriented languages MATALB supports classes, interfaces and functions.
They are used in high level MATLAB programming.
MATLAB directly supports both the Analogue and Digital Signal processing. It has defined a
set of rich features to work with Analogue and Digital Signal Processing. Signal transforms
and spectral analysis, digital system design, digital filtering, adaptive filtering, coding and
compression algorithms are the features which supports by MATLAB.
46. Ultimate Speech Search
Page 37
CHAPTER 3
ANALYSIS
3.0 System requirements
3.11 Functional requirements
1. The application must convert to speech or the lecture to a text format.
2. Converted text should be visible to the user.
3. If the speech or the lecture has noise it must be reduced in a manner that eligible for speech
recognition process.
4. Speeches with different accent need to be identified in a reasonable manner by the system.
5. The search results must be efficient and reliable.
3.1.2 Non functional requirements
1. Search algorithm need to be efficient.
2. Should not cater duplicate search results.
3. Should not take more time in searching.
4. Speech to text conversion must be efficient and accurate.
5. Noise reduction must maintain a fair performance.
47. Ultimate Speech Search
Page 38
3.1.3 Software Requirements
Java JDK 1.6:- JDK 1.6 equipped with the state of the art technology and includes much
functionality. Java Sound API newest version must be required.
NetBeans IDE 6.5:- is an open source IDE and it equipped with h PHP, JavaScript and Ajax
editing features, improved and the Java Persistence API, and tighter Glassfish v3 and MySQL
integration. It also facilitates features for the architectural drawing of the system. It also
equipped powerful J2EE components that are essential to build the search engine. We can
integrate any third party component that used for the system without much efforts and it has
the feature of code generation. Many non open source plugging supports this IDE.
Windows XP or equivalent operating system: - Windows XP operating system supports both
the open source components as well as commercialized components. We can deploy
everything that is essential for our project. Windows XP is a robust error less, user friendly
operating system compare to other windows operating system.
Apache tomcat server: version 5.5.27-Available in http://tomcat.apache.org/ is a freely
available server that we can run web programs on it. It is robust and open source. It has many
third party components that s essential to integrate stand alone, mobile, web based application
in to each other. This server comes with the NetBeans IDE.
XML Database:- This is the world‟s most popular database and its open source. It directly
supports for apache tomcat server and the NetBeans IDE and the crashing rate are lesser
compared to other databases with web services.
Proper sound driver software is required in order to achieve best results.
Matlab software required to perform noise filtering.
48. Ultimate Speech Search
Page 39
3.1.4 Hardware requirements
32 bit Intel Dual Core IV processor or greater:- concern about the development phase of the
project a massive amount of processing power is required as for the speech recognition and
for noise analysis , text to speech conversion and for the search. It is advisable to have high
end machine inured to prevent deadlocks.
64 bit PCI sound card: - A high end sound card required to process digital audio signals.
Minimum of 1 GB DDR3 RAM is required and 2 GB of virtual memory must be present
in the system.
The default components of a personal computer must required
A modem or a router is required in order to test the search between many users.
1mb ADSL internet connection or greater is required for the data gathering.
A microphone must need as for the future enhancements. So the users can store their own
speech and as for a future use any one can search any particular speeches and lectures.
20GB hard disk is required with 7200 or more rotation rate because the system going to maintain
the database in my machine.
Note: At least a Duel Core Processor is required because the speech recognition process
needs a massive processing power.
49. Ultimate Speech Search
Page 40
3.2 System Development Methodologies
All the methodologies compared in here were extended versions of previously commonly
used methodologies.
3.2.1 Rational Unified Process
Rational Unified process is a development methodology created by the rational software
division of IBM in 2003. It‟s an iterative system development process. RUP explains how
specific goals are achieved in a detailed manner.
RUP is a methodology of Managing Object Oriented software development. According to
Kroll and Kruchten (2003) “The RUP is a software development approach that is iterative,
architecture-centric, and use-case-driven.” RUP has extensible features and they are as
follows.
Iterative Development
Requirements Management
Component-Based Architectural Vision
Visual Modeling of Systems
Quality Management
Change Control Management
50. Ultimate Speech Search
Page 41
The figure below shows basic overviews of its phases.
Source: (anon. (nd). Department of Computer Science. Available:
people.cs.uchicago.edu/~matei/CSPP523/lect4.ppt. Last accessed 24 march 2009.)
Figure 11 Phases in RUP
51. Ultimate Speech Search
Page 42
Advantages of RUP
It is a well-defined and well-structured software engineering process.
It supports changing requirements and provides means to manage the change and
related risks
It promotes higher level of code reuse.
It reduces integration time and effort, as the development model is iterative.
It allows the systems to run earlier than with other processes that essential for the
system.
Risk management feature allows identifying risks before the development process.
It has the unique feature that “Plan a little”, “design a little” and codes a little.
RUP is an idea driven, principle based methodology.
RUP methodology is a worldwide commercial standard.
Disadvantages of RUP
For most of the projects RUP is an insufficient methodology.
We need to customize the processes due to various situations.
It has a poor usability support.
The process in relatively complex and the weight age is high.
52. Ultimate Speech Search
Page 43
3.2.2 Agile Development Method
Agile development methodology is an iterative process. Agile has short time iterations and
due to that have minimum risk. The Agile software development methodology has the feature
of break tasks into small increments with minimal planning and it won‟t directly involve long
term planning. Agile highly supports for object oriented developing.
Most of all Agile has the unique feature called Extreme programming, now widely used in
software development process.
According to Ambler (2005) Agile is an iterative and incremental (evolutionary) approach to
software development which is performed in a highly collaborative manner by self-
organizing teams within an effective governance framework that produces high quality
software in a cost effective and timely manner which meets the changing needs of its
stakeholders.
Figure 12 : Overview of Agile
53. Ultimate Speech Search
Page 44
Advantages in Agile Software Development
Increased Control
Rapid Learning
Early Return on Investment
Satisfied Stakeholders
Responsiveness to Change
Disadvantages in Agile Software Development
Agile evolves heavy documentation.
Agile Requirements are barely insufficient for the projects.
Not an organized methodology.
Because testing is integrated through the development the development cost is
relatively high.
Too much user involvement may spoil the project.
54. Ultimate Speech Search
Page 45
3.2.3Scrum Development Methodology
According to Mikneus,s , S., Akinde, A. (2003)
Scrum is an Agile Software Development Process.
Scrum is not an acronym
Name taken from the sport of Rugby, where everyone in the team pack acts together
to move the ball down the field
Analogy to development is the team works together to successfully develop quality
software
According to Jeff Sutherland (2003) “Scrum assumes that the systems development process is
an unpredictable, complicated process that can only be roughly described as an overall
progression.” “Scrum is an enhancement of the commonly used iterative/incremental object-
oriented development cycle” Scrum principles include:
Quality work: empowers everyone involved to be feeling good about their job.
Assume Simplicity: Scrum is a way to detect and cause removal of anything that gets
in the way of development.
Embracing Change: Team based approach to development where requirements are
rapidly changing.
Incremental changes: Scrum makes these possible using sprints where a team is able
to deliver a product (iteration) deliverable within 30 days.
55. Ultimate Speech Search
Page 46
Advantages in Scrum
Scrum has the ability to respond unseen software development risks
It‟s a specialized process for commercial application development.
It gives the developers of facility to deliver a functional application to the clients.
Disadvantages in Scrum
Not suitable for researched based software developments.
Source[anon. (nd). anon. Available: http://www.methodsandtools.com/archive/scrum1.gif.
Last accessed 26 th March 2009.]
3.2.4 Conclusion
Agile software development methodology has been chosen for the development process
because it supports Object oriented development, has short iterations and supports Extreme
programming.
Figure 13 : Scrum Overview
56. Ultimate Speech Search
Page 47
3.3 Test Plan
The Systems main functionalities are noise analysis, speech recognition, database indexing
(directly effects to the search) and the search engine.
The system takes data (Speeches and lectures) from various conditions with lowered noise. But
the system cannot assure the effect of the noise factor. Due to that reason we perform noise
analysis and try to reduce it. Otherwise it will affect to the speech recognition process.
3.3.1System testing
Speeches and lectures with different accent [English only USA and British]:- In order to test
the speech recognition engines accuracy it will tested against different accents. The expected
results must be in a minimum difference with minimum errors.
Content Search:- when the user tries to search by the content by typing a word or a phrase
the appropriate search result will be displayed. The speech or the lecture containing the
specified words or the phrase will be displayed
57. Ultimate Speech Search
Page 48
CHAPTER 4
SYSTEM DESIGN
4.1 Use Case Diagram
The noise filters functionality implements separately from the speech recognition system.
Noise filtering system represents as the “Actor”.
Figure 14 : Use Case Diagram for System
58. Ultimate Speech Search
Page 49
The Figure above shows the Use Case Diagram for the entire system. The System mainly
consists with two actors. A user can uploads speech file in wav format to perform the speech
recognition.
Noise filtering handled by a separate system. The user has to upload a noisy speech file and
the noise filtering system will produce a file with lowered noise.
59. Ultimate Speech Search
Page 50
4.2 Use case description
4.2.1 Use case description for file upload
Use Case Use Case One
Description User uploads a file
Actors user
Assumptions User uploads a file in .wav format. The user has to upload a file without
noise.
Steps User has to run the system, press open button and have to select a file
Variations A user may uploads a file without noise or with noise,
Non functional
requirements
All the necessary hardware configuration must met.
Issues None
Table 4 : Use Case description file upload
60. Ultimate Speech Search
Page 51
4.2.2Use Case description for play an audio file
Use Case Use Case Two
Description User plays a .wav file
Actors User
Assumptions User can only play a file in wav format
Steps User has to open a file, and then the button play gets enabled. User has to
press the play button.
Variations No variations , only files in wav format can be played
Non functional
Requirements
All the necessary hardware configuration must met.
Issues None
Table 5 Use Case description play audio
61. Ultimate Speech Search
Page 52
4.2.3 Use Case description for search
Use Case Use Case Three
Description User search for a speech by content
Actors User
Assumptions User can search a speech by typing a sentences
Steps User has to run the speech search program. Type the thing he/she wants to
search for and presses search
Variations No variations
Non functional All the necessary hardware configuration must met.
Issues None
Table 6 Use Case description search
62. Ultimate Speech Search
Page 53
4.2.4 Use Case description for noise reduced output
Use Case Use Case Four
Description Noise reduction output produced by the system
Actors Noise filtering system
Assumptions Permanent elimination of the noise is unreachable.
User uploads a noisy file in a wav format
Steps
User has to run the noise filtering program in MATLAB
User has to input a file which includes the noise
Variations No variations
Non functional All the necessary hardware configuration must met.
Issues None
Table 7 Use Case description noise reduction
63. Ultimate Speech Search
Page 54
4.2.5 Use Case description for noise filtering
Use Case Use Case Five
Description The process of filtering noise
Actors Noise filtering system
Assumptions Permanent elimination of the noise is unreachable.
User uploads a noisy file in a wav format
The chosen mechanism for noise filtering is the most suitable one
Steps
User has to run the noise filtering program in MATLAB
User has to input a file which includes the noise
Variations No variations
Non functional All the necessary hardware configuration must met.
Issues None
Table 8 Use Case description noise process
64. Ultimate Speech Search
Page 55
4.3 Activity Diagrams
4.3.1Activity Diagram for Speech Recognition System
Figure 15 Speech Recognition
66. Ultimate Speech Search
Page 57
4.4 Sequence Diagrams
4.4.1 Select a file
Figure 17 Sequence Diagram Select a file
67. Ultimate Speech Search
Page 58
4.4.2 Play wav file
The system can play a file. Two main control classes involve this process. The
WavFileRecognition class acts as a mediator which passes messages between functionalities
on other classes.
Figure 18 Sequence Diagram Play File
68. Ultimate Speech Search
Page 59
4.4.3Speech recognition pre stage
In Speech recognition pre stage, the system gets loaded with the configuration file and input
signal. A recognizer will allocate through the configuration manager.
Figure 19 Sequence Diagram SR Pre Stage
69. Ultimate Speech Search
Page 60
4.4.4Speech Recognition post stage
In speech recognition post stage the input digital signal will go through fast Fourier
transformation segmenting, identifying dialects and phonemes. The Classes
AudioFileDataSource and the Recognizer facilitates functionalities to perform these tasks.
Figure 20 Sequence Diagram SR Post Stage
70. Ultimate Speech Search
Page 61
4.5 Class Diagrams
4.5.1 GUI and the system
The figure below shows the class diagram of the GUI and WavFileRecognizer.
Figure 21 Class Diagrams GUI & System
73. Ultimate Speech Search
Page 64
4.6 Noise Filtering
Noise filtering has done using Matlab. Matlab support objects orientation, polymorphism
or inheritance. I have generated a code in C to tally the code in Matlab.
%ver 1.56
function noiseReduction
%----- user data -----
steps_1 = 512;
chunk = 2048;
coef = 0.01*chunk/2;
The 3 code segments above defines user data which going to use in MATLAB script. The
term chunk means a small piece of segment of the input signal. The script below can be used
to filter the noise for any given input signal.
%Windowing Techniques
%w1 = .5*(1 - cos(2*pi*(0:chunk-1)'/(chunk))); %hanning
w1 = [.42 - .5*cos(2*pi*(0:chunk-1)/(chunk-1)) + .08*cos(4*pi*(0:chunk-1)/(chunk-1))]';
%Blackman
w2 = w1;
Backman Window technique used here to chop the signal in to small segments. In here the
input signal will recursively split in to small chunks. Chunk is the technical term for a
segment in digital signal processing.
% input wav file and extract required data
[input, FS, N] = wavread('input.wav');
L = length(input);
The input signal will extract and re arrange in to a matrix. Length is the total propagating
duration of the signal. The matrix mechanism hidden by the MATLAB.
74. Ultimate Speech Search
Page 65
% zero padding for intput file
input = [zeros(chunk,1);input;zeros(chunk,1)]/ max(abs(input));
%the appended zeros to the back of the input sound file makes it so that the windowing
samples the complete sound file
%----- initializations -----
output = zeros(length(input),1);
count = 0;
% block by block fft algorithm
Normally a noise signal has a higher frequency. After the system gets median value for
noise factor. The functions below recursively take segments and analyze the mean value.
while count<(length(input) - chunk)
grain = input(count+1:count+chunk).* w1; % windowing
f = fft(grain); % fft of window data
r = abs(f); % magnitude of window data
phi = angle(f); % phase of window data
ft = denoise(f,r,coef);
This function will reduce the amplitude of each chunk. A single chunk will take as an
argument by the function.
grain = real(ifft(ft)).*w2; % take inverse fft of window data
output(count+1:count+chunk) = output(count+1:count+chunk) + grain; % append
data to output file
count = count + steps_1; % increment by hop size
end
output = output(1:L) / (4.75*max(abs(output))); %the 4.75*max(abs(output) maintains
consistency between input and output volume
%soundsc(output, FS);
wavwrite(output, FS, 'output.wav');
As you can see there are no classes or Interfaces. Equivalent code for the Matlab in C
programming language is shown below.
75. Ultimate Speech Search
Page 66
function ft = denoise(f,r,coef)
if abs(f) >= 0.001
ft = f.*(r./(r+coef));
else
ft = f.*(r./(r+sqrt(coef)));
end
The shown above is denoise function. The function analyzes each signal chunk‟s absolute
frequency against its mean value. Then it will get modified by the coefficient and the square
root recursively. This process continues till the higher frequency clusters eliminates to lower
frequencies.
76. Ultimate Speech Search
Page 67
4.7 Code to filter noise in C Language
#include <stdio.h>
#include "mclmcr.h"
#ifdef __cplusplus
extern "C" {
#endif
extern const unsigned char __MCC_denoise2_public_data[];
extern const char *__MCC_denoise2_name_data;
extern const char *__MCC_denoise2_root_data;
extern const unsigned char __MCC_denoise2_session_data[];
extern const char *__MCC_denoise2_matlabpath_data[];
extern const int __MCC_denoise2_matlabpath_data_count;
extern const char *__MCC_denoise2_mcr_runtime_options[];
extern const int __MCC_denoise2_mcr_runtime_option_count;
extern const char *__MCC_denoise2_mcr_application_options[];
extern const int __MCC_denoise2_mcr_application_option_count;
#ifdef __cplusplus
}
#endif
static HMCRINSTANCE _mcr_inst = NULL;
static int mclDefaultPrintHandler(const char *s)
{
return fwrite(s, sizeof(char), strlen(s), stdout);
}
static int mclDefaultErrorHandler(const char *s)
{
int written = 0, len = 0;
len = strlen(s);
written = fwrite(s, sizeof(char), len, stderr);
if (len > 0 && s[ len-1 ] != 'n')
written += fwrite("n", sizeof(char), 1, stderr);
return written;
}
bool denoise2InitializeWithHandlers(
mclOutputHandlerFcn error_handler,
mclOutputHandlerFcn print_handler
)
{
if (_mcr_inst != NULL)
return true;
77. Ultimate Speech Search
Page 68
if (!mclmcrInitialize())
return false;
if (!mclInitializeComponentInstance(&_mcr_inst,
__MCC_denoise2_public_data,
__MCC_denoise2_name_data,
__MCC_denoise2_root_data,
__MCC_denoise2_session_data,
__MCC_denoise2_matlabpath_data,
__MCC_denoise2_matlabpath_data_count,
__MCC_denoise2_mcr_runtime_options,
__MCC_denoise2_mcr_runtime_option_count,
true, NoObjectType, ExeTarget, NULL,
error_handler, print_handler))
return false;
return true;
}
bool denoise2Initialize(void)
{
return denoise2InitializeWithHandlers(mclDefaultErrorHandler,
mclDefaultPrintHandler);
}
void denoise2Terminate(void)
{
if (_mcr_inst != NULL)
mclTerminateInstance(&_mcr_inst);
}
int main(int argc, const char **argv)
{
int _retval;
if (!mclInitializeApplication(__MCC_denoise2_mcr_application_options,
__MCC_denoise2_mcr_application_option_count))
return 0;
if (!denoise2Initialize())
return -1;
_retval = mclMain(_mcr_inst, argc, argv, "denoise2", 0);
if (_retval == 0 /* no error */) mclWaitForFiguresToDie(NULL);
denoise2Terminate();
mclTerminateApplication();
return _retval; }
/*
* MATLAB Compiler: 4.0 (R14)
* Date: Sun Oct 04 09:55:11 2009
* Arguments: "-B" "macro_default" "-m" "-W" "main" "-T" "link:exe" "denoise2"
82. Ultimate Speech Search
Page 73
CHAPTER 5
5.0 Implementation
The Agile development process was chosen for the development. The system went on three
iterations. In the first iteration the basic objective was to build a speech recognition engine.
Various methods were tested out. But in the first iteration the speech recognition engine was
built.
Figure 24: SR Engine
83. Ultimate Speech Search
Page 74
The figure below shows the functionalities in speech recognition engine. It can open .wav file
to play or to recognize speech.
Figure 25 Open file
84. Ultimate Speech Search
Page 75
Once a file selected for the recognition a user can press the start button to start the
recognition process. The recognized output can be viewed in the text output section.
Figure 26: Text output
85. Ultimate Speech Search
Page 76
The noise filtering process has done in the second iteration and it‟s completely done by using
MATLAB.
It doesn‟t have a user interface. In the first development the noise filtering engine was not
that efficient. There were many isolated noise packets in the spectrum. But in the second
development the system could achieve a remarkable performance.
We have to input a noisy speech file and when we runs the program it will produced a noise
filtered .wav file.
86. Ultimate Speech Search
Page 77
The Search engine was built on the third phase. The user has to run the search engine and it
will access the local database and gives the search results.
Figure 27 Speech Search Engine
87. Ultimate Speech Search
Page 78
CHAPTER 6
6.0 Test Plan
6.1 Background
The system that built for the research project was comprises with three main parts. The
speech recognition section is the key part in this application. The noise filtering part section is
another key are that taken in to accounts. There‟s a text search in the system which provides
the facility to search the speech by content. Because this was a technical project and with
consideration of the nature of the projects, the testing criteria‟s would not looks the same
compared with other projects.
6.2 Introduction
As for the test plan the testing criteria‟s will based on the input speech signals for the speech
recognition and noise filtering and searching criteria. Due to nature of this project we cannot
make the use of industrial test plans. The project is not a commercial project. As for the
speech recognition testing criteria a speech in a digital format will use. Speech recognition
projects are still in the research stage. So it‟s not advisable to implement a standard heavy
weight test plan. Basic test plans will sufficient to asses the testing criteria‟s mentioned in the
project.
88. Ultimate Speech Search
Page 79
6.3 Assumptions
Before declaring any assumptions it is advisable to understand the nature of the project.
Within the project scope we can assume that the speech recognition engine will only works
for noiseless speech inputs. The speech recognition system will only work on pure English
accent only. Noisy speech will not use as inputs because the speech engine won‟t directly
identify the noise factor and filters it.
The system only can identify the most speaking words. It is possible to add large vocabulary.
Due to the fact, the system haven‟t designed for high level language identification and
processing.
Noise filtering can be done on “.wav” format only. System cannot eliminate the noise factor
permanently.
It is not possible to use a file which have been filtered the noise for the recognition, because
the speech recognition system will works on noiseless accent only.
6.4 Features to be tested
For the speech recognition system a noiseless speech input in .wav format will be tested to
identify the continuous speech recognition capabilities. Continuous speech recognition
capability is a unique feature in modern speech recognition systems.
A noisy speech file will upload to noise filtering system and it will results a noise filtered [up
to a reasonable level] output file. It is possible to measure the efficiency of the noise filtering
system by measuring the amount of time it will take for processing. It is not addressed here in
the project.
For the speech searching part the system will use a file search. The search mechanism will
include an efficient file searching and text matching mechanism. Once the user typed for a
phrase, the system will show the mostly containing file name.
System can play a wav file before uploading for the recognition process.
89. Ultimate Speech Search
Page 80
6.5 Suspension and resumption criteria
While the system testing process running and if there are defects there are reasons to
suspense the process. Suspension criteria denote what are those reasons. According to Anon.
(nd). Suspension criteria & resumption requirements
The suspension criteria as follows
Unavailability of external dependent systems during execution.
When a defect is introduced that cannot allow any further testing.
Critical path deadline is missed so that the client will not accept delivery even if all
testing is completed.
A specific holiday shuts down both development and testing
The resumption criteria‟s as follows
When the external dependent systems become available again.
When a fix is successfully implemented and the Testing Team is notified to continue
testing.
The contract is renegotiated with the client to extend delivery.
The holiday period ends.
According Anon. (nd). Suspension criteria & resumption requirements
Suspension criteria assume that testing cannot go forward and that going backward is also not
possible. A failed build would not suffice as you could generally continue to use the previous
build. Most major or critical defects would also not constituted suspension criteria as other
areas of the system could continue to be tested.
90. Ultimate Speech Search
Page 81
6.6 Environmental needs
There are few environmental needs to be met before testing the system. The environmental
needs can be classified as software needs, hardware needs and legal needs. There are no legal
needs because the system does not have any links with legal situations.
The list of Software needs can be list down as below
Java run time environment
Matlab development software
NetBeans 6.5 or greater
Sound driver software
Windows XP operation system
The hardware needs are
A computer[hardware requirements were specified in another chapter under system
requirements]
Multimedia devices
91. Ultimate Speech Search
Page 82
6.7 System testing
Speeches and lectures with different accent [English only USA and British]:- In order to test
the speech recognition engines accuracy it will tested against different accents. The expected
results must be in a minimum difference with minimum errors.
Content Search:- when the user tries to search by the content by typing a word or a phrase
the appropriate search result will be displayed. The speech or the lecture containing the
specified words or the phrase will be displayed
92. Ultimate Speech Search
Page 83
6.8 Unit testing
The initial testing was the initial user interface. At the first glance the system only loads with
the basic interactions with the user. The system doesn‟t load any calculation or extraction
functionalities before a user provides a correct input for the system.
Test Case Test Case One
Description The user runs the Speech recognition System for the first time
Expected Output Open, Start and Open Speech buttons set enabled.
Encode To wav, Noise Filter buttons remain disabled.
The area below open a speech file shows blank.
Text output must show blank.
Actual Output Open, Start and Open Speech buttons set enabled.
Encode To wav, Noise Filter buttons remain disabled.
The area below open a speech file shows blank.
Text output must show blank.
Actual output acquired.
Table 9 Test Case 1
On the initial run the speech recognition system won‟t load with any algorithms. After giving
an input the system will load the necessary components for processing. This mechanism will
utilize the system resources.
93. Ultimate Speech Search
Page 84
The second testing criteria begin when the user provides and input to the system. This test
case interacts with the speech recognition system‟s input. The input can be a .wav file.
Test Case Test Case Two
Description The user opens a file to feed the speech recognition system
The user provides for the system with .wav file.
The first input speech contains digits in the range of one to nine in
British accent.
File must be a noise free file.
Expected Output Identified names of the digits needs to be display in text output area.
Actual Output Due to variations in dialect the expected results would not the same.
Within the range of one to nine the system identifies the digits and
displays the output.
Table 10 Test Case 2
The identification of digits can be extending beyond ten. Once the name of the digit to be
identified becomes longer, the system identifies the digits with an error rate.
94. Ultimate Speech Search
Page 85
The third testing criterion was based on the user inputs a file with noise for identification.
The system does not work for files contains with noise.
Test Case Test Case Three
Description The user provides for the system with a .wav file with noise
Expected Output The system will throws an error or the system shows no results
Actual Output The actual output varies due to different noise levels. If the density of
the noise lays within a higher range the system go for an error. The
error can be “severe null”.
The system will go blank results due to the fact that the words are
merely in an identifiable stage.
Table 11 Test Case 3
The system doesn‟t have any functionality to measure the noise levels. The project scope
won‟t cater for in depth noise analysis. The levels of noise mentioned above were measured
in user experience.
The system assumes that the users would not upload files with noise to the system and this
rule clearly mentioned in assumptions.
95. Ultimate Speech Search
Page 86
The fourth testing c criterion is to check the systems speech recognition capabilities with
words.
Test Case Test Case Four
Description The user provides for the system with a .wav file containing basic
words.
The input doesn‟t contain any noise.
Expected Output The system identified all the words and shows the output in a more
precise manner.
Actual Output The system identified words with an error rate. The error rate is
fluctuates between from 20% to 35%.
Not all the words will identify by the system.
Table 12 Test Case 4
The System doesn‟t identify all the words. The identification process depends on the speed of
the utterance rate and the intensity of the phoneme. Higher intensities on phoneme help any
speech recognition systems to achieve more precise results.
96. Ultimate Speech Search
Page 87
Test case five tests the performance of noise filtering. The noise filtering system was built in
MATLAB.
Test Case Test Case Five
Description The user provides the noise filtering system with a noise file.
The input file must in .wav format.
The user has to open the MATLAB Scripts, import them to working
directory and need to run.
The file to be input need to be in the same directory.
Expected Output An output file should be create in the working folder with the name
“output.wav”
“output.wav” file contains the noise filtered version of input file.
Amplitude of the output file should not have a difference which can
identify by a human.
Actual Output Output file creates in working folder.
Output file has a lowered noise relative to the input file.
Output file is not noise free.
Amplitude has a different which can identify by a human ear.
Table 13 Test Case 5
Still there isn‟t a mechanism to remove the noise for 100%. The system will works on
predefined algorithms.
97. Ultimate Speech Search
Page 88
Test case six tests the criterion for search functionality. The search functionality acts as a
speech search engine.
Test Case Test Case Six
Description The user has to run the search engine.
Port 8080 must be free.
Expected Output When user types a phrase to search on search engine and press search
button.
If there‟s a match in the database it will show true.
If there‟s no match the results will show as false.
Actual Output If a match was found “true” displays in the results.
If no match “false” displays in results.
Table 14 Test Case 6
The system doesn‟t build for actual speech engine. It will only demonstrate how the speech
search engine works. As for future enhancements it‟s possible to build an actual search
engine.
98. Ultimate Speech Search
Page 89
6.9 Performance Testing
The System‟s performance was tested in different operating systems. Operating systems
include virtual operating environments.
The absolute operating system in order to take the measurements was taken as the Microsoft
Windows XP.
Operating System Microsoft Windows XP
Speech recognition engine configuring
time
Between 0.5 seconds and 1 second
Efficiency of Speech recognition Input signal which having greater phoneme
intensity, free from noise and duration less
than 10 seconds with low word density will
take around 1 second to 12 seconds.
Input signals which have many words will
take longer times.
Efficiency of Noise filtering and
MATLAB
Noise filtering system generates the output
less than 200 milliseconds for .wav file clips
which having a duration between 2 to 10
seconds.
Performance of Speech search engine Startup time for the Speech Search has an
average of 8 to 15 seconds.
Table 15: Performance testing windows XP
99. Ultimate Speech Search
Page 90
The performance of the speech search engine totally depends on the operating system. As for
an example the windows operation systems use much more resources than UNIX based
operating systems.
The speech search system runs on the Glassfish server. The glassfish server has more
performance in UNIX based operating systems. In windows environments the speech search
engine has many deadlocks.
Operating System Ubuntu 9.04
Speech recognition engine configuring
time
Between 0.2 seconds and 0.8 second
Efficiency of Speech recognition Input signal which having greater phoneme
intensity, free from noise and duration less
than 5 seconds with low word density will
take around 1 second to 5 seconds.
System has a greater positive effect when it‟s
work on Ubuntu environments.
Efficiency of Noise filtering and
MATLAB
Noise filtering was efficient compared with
windows environment.
Performance of Speech search engine The search and the startup time of the search
engine were efficient compared with
windows XP.
Table 16 : Performance Testing on UBUNTU
100. Ultimate Speech Search
Page 91
Once the Search engine runs on many times in windows environment it has a higher potential
of crashing and it would not provide the correct results.
When the system uses to perform the speech recognition for several times the efficiency of
the recognition slows down.
Java runs on a virtual environment and the recognition process needs a higher processing
power. Due to those factors the efficiency of the system will degrade as it uses over and over.
101. Ultimate Speech Search
Page 92
6.10 Integration Testing
Integration testing is a logical extension of unit testing. Integration testing identify drawback
when a system combines. Before performing integration System for the system it comprises
with different systems with different functionalities.
It is not possible to combine the noise filtering system with speech recognition system or
search web browser,.
An overall test mechanism used for integration testing due to the fact that the system
comprises with sub systems which indirectly have a connection with each other.
Big Bang testing
Big Band testing is the process of taking the entire unit testing criteria for a System and ties
them together. This approach mostly suitable for small systems and May results many
unidentified errors on testing stages. If a developer has done unit testing correctly, Bug bang
testing will helps to uncover more errors and it will save money and time.
In the system after performing the big bang testing the following faults were recovered.
The continuous functionality of the search engine cannot guaranty.
If the length of the input signal was long there will be a system out of memory error.
The disadvantages in Big bang testing are
Cannot start integration testing until all the modules have been successfully
evaluated.
Harder to track down the causes for the errors
102. Ultimate Speech Search
Page 93
Incremental Testing
Incremental Testing allows you to compare and contrast two functionalities with you are
testing. You can add and test for other modules within the testing time.
Incremental testing cannot perform to the system because there are no parallel functionalities
within the system which interact each others.
103. Ultimate Speech Search
Page 94
CHAPTER 7
CRITICAL EVALUATION AND FUTURE ENHANCEMENTS
7.1Critical evaluation
The entire project was about speech recognition using digital signal by input and search by
content. The project is a union of several other research areas. At the initial stage the research
was focused on to speech recognition.
The barriers met in the initial stage
Human speech recognition
At the beginning there wasn‟t a way to explain the speech recognition process,
the mechanisms behind that and how it was performed.
Speech recognition engine
Study of speech recognition engine was a crucial part for the design phase.
There was no speech recognition engine to analyze or to study.
In order to overcome those two factors first of all the functionality of speech recognition was
essential. After understanding a system and completion of a basic sketch of the flow
diagram, it seems a sufficient starting point for the development.
When talks over a microphone, it so easy to record a human voice. After recording the
human voice is no longer in analogue format. The obvious digital format was a .wav file. The
system going to performs the speech recognition for the file in .wav format.
104. Ultimate Speech Search
Page 95
In the development phase the study of speech recognition system didn‟t help much for further
proceedings. Because when it comes to audio formats the digital signal processing part was
hidden. The system has to address the DSP part in a reasonable manner. Digital signal
processing is about concerned with the representation of the signals by a sequence of
numbers or symbols and the processing of these signals.
Within the course content we studied there wasn‟t a single module that tough us about
Interface programming, micro control programming or digital signal processing.
Building a functionality to handle the digital signal processing part from the scratch was a
tedious job. The knowledge that we have to build such functionality wont sufficient
compared with the time.
At the initial stage the plan was to develop the entire system in JAVA .but java didn‟t have a
built in proper API or luxuries to handle digital signal processing. However there were some
reliable third party components that merely manage to perform the task.
Plug in the third party tools was another issue. But finally manage to find codes in order to
accomplish the task.
There was few speech recognition systems were built using java. But the fact that they were
not built for continuous speech recognition or for noise reduction.
There were many issues in the first place. We have to define a grimmer format. There were
two options. One is to go with JVXML. Java voice xml is a technology which provides
speech synthesis capabilities and recognition capabilities. We can embed voice commands for
web sites using voice xml.
As for the project I have choose JSGF or JAVA speech and grammar format. Java speech and
grammar format supports inbuilt dictionaries which capable to support digits and words. We
can plug multi language capabilities.
When developing systems using JAVA it‟s always advisable to use the components that
easily support JAVA platform capabilities.
105. Ultimate Speech Search
Page 96
The speech recognition system can split in to two systems. The Java virtual machine allocates
maximum of 128m memories for NetBeans. We cannot explicitly define the amount of the
virtual machine when we are working with net beans.
The digits recognition part can be performed and implemented in NetBeans development
environment.
But it‟s not possible to free the memory for recognize speech that contain words. They need a
higher level of virtual memory from the virtual machine. Because of that the speech
recognition for words had to run in Command prompt explicitly saying “java –mx256m -
jar”. This command allocated 256m virtual memory for speech recognition.
Noise filtering was another unsolved issue that had to answer through the system. For noise
filtering there was no proper support in JAVA. If you are doing a technical project its
essential to develop in 4th
generation languages or languages like C , assembly .