speech enhancement

A REGRESSION APPROACH TO SPEECH ENHANCEMENT BASED
ON DEEP NEURAL NETWORKS
CHAPTER 1
ABSTRACT:
In contrast to the conventional minimum mean square error (MMSE)-based noise reduction
techniques, we propose a supervised method to enhance speech by means of finding a mapping
function between noisy and clean speech signals based on deep neural networks (DNNs). In
order to be able to handle a wide range of additive noises in real-world situations, a large training
set that encompasses many possible combinations of speech and noise types, is first designed.
DNN architecture is then employed as a nonlinear regression function to ensure a powerful
modeling capability. Several techniques have also been proposed to improve the DNN-based
speech enhancement system, including global variance equalization to alleviate the over-
smoothing problem of the regression model, and the dropout and noise-aware training strategies
to further improve the generalization capability of DNNs to unseen noise conditions.
Experimental results demonstrate that the proposed framework can achieve significant
improvements in both objective and subjective measures over the conventional MMSE based
technique. It is also interesting to observe that the proposed DNN approach can well suppress
highly nonstationary noise, which is tough to handle in general. Furthermore, the resulting DNN
model, trained with artificial synthesized data, is also effective in dealing with noisy speech data
recorded in real-world scenarios without the generation of the annoying musical artifact
commonly observed in conventional enhancement methods.

1.2 INTRODUCTION:
SPEECH ENHANCEMENT:
Speech enhancement aims to improve speech quality by using various algorithms. The
objective of enhancement is improvement in intelligibility and/or overall perceptual quality of
degraded speech signal using audio signal processing techniques. Enhancing of speech degraded
by noise, or noise reduction, is the most important field of speech enhancement, and used for
many applications such as mobile phones, VoIP, teleconferencing systems, speech recognition,
and hearing aids
SPEECH RECOGNITION:
Speech recognition (SR) is the inter-disciplinary sub-field of computational linguistics which
incorporates knowledge and research in the linguistics, computer science, and electrical
engineering fields to develop methodologies and technologies that enables the recognition
and translation of spoken language into text by computers and computerized devices such as
those categorized as Smart Technologies and robotics. It is also known as "automatic speech
recognition" (ASR), "computer speech recognition", or just "speech to text" (STT).
Some SR systems use "training" (also called "enrollment") where an individual speaker reads
text or isolated vocabulary into the system. The system analyzes the person's specific voice and
uses it to fine-tune the recognition of that person's speech, resulting in increased accuracy.
Systems that do not use training are called "speaker independent"[1] systems. Systems that use
training are called "speaker dependent".
Speech recognition applications include voice user interfaces such as voice dialing (e.g. "Call
home"), call routing (e.g. "I would like to make a collect call"), domotic appliance control,
search (e.g. find a podcast where particular words were spoken), simple data entry (e.g., entering
a credit card number), preparation of structured documents (e.g. a radiology report), speech-to-

text processing (e.g., word processors or emails), and aircraft (usually termed Direct Voice
Input).
The term voice recognitionor speaker identification refers to identifying the speaker, rather than
what they are saying. Recognizing the speaker can simplify the task of translating speech in
systems that have been trained on a specific person's voice or it can be used to authenticate or
verify the identity of a speaker as part of a security process.
From the technology perspective, speech recognition has a long history with several waves of
major innovations. Most recently, the field has benefited from advances in deep learning and big
data. The advances are evidenced not only by the surge of academic papers published in the
field, but more importantly by the world-wide industry adoption of a variety of deep learning
methods in designing and deploying speech recognition systems. These speech industry players
include Microsoft, Google, IBM, Baidu (China), Apple, Amazon, Nuance, IflyTek (China),
many of which have publicized the core technology in their speech recognition systems being
based on deep learning.
MODELS, METHODS, AND ALGORITHMS:
HIDDEN MARKOV MODELS
Modern general-purpose speech recognition systems are based on Hidden Markov Models.
These are statistical models that output a sequence of symbols or quantities. HMMs are used in
speech recognition because a speech signal can be viewed as a piecewise stationary signal or a
short-time stationary signal. In a short time-scale (e.g., 10 milliseconds), speech can be
approximated as a stationary process. Speech can be thought of as a Markov model for many
stochastic purposes.
Another reason why HMMs are popular is because they can be trained automatically and are
simple and computationally feasible to use. In speech recognition, the hidden Markov model

would output a sequence of n-dimensional real-valued vectors (with n being a small integer, such
as 10), outputting one of these every 10 milliseconds. The vectors would consist
of cepstral coefficients, which are obtained by taking a Fourier transform of a short time window
of speech and decorrelating the spectrum using a cosine transform, then taking the first (most
significant) coefficients. The hidden Markov model will tend to have in each state a statistical
distribution that is a mixture of diagonal covariance Gaussians, which will give a likelihood for
each observed vector. Each word, or (for more general speech recognition systems),
each phoneme, will have a different output distribution; a hidden Markov model for a sequence
of words or phonemes is made by concatenating the individual trained hidden Markov models
for the separate words and phonemes.
Described above are the core elements of the most common, HMM-based approach to speech
recognition. Modern speech recognition systems use various combinations of a number of
standard techniques in order to improve results over the basic approach described above. A
typical large-vocabulary system would need context dependency for the phonemes (so phonemes
with different left and right context have different realizations as HMM states); it would
use cepstral normalization to normalize for different speaker and recording conditions; for
further speaker normalization it might use vocal tract length normalization (VTLN) for male-
female normalization and maximum likelihood linear regression (MLLR) for more general
speaker adaptation.
The features would have so-called delta and delta-delta coefficients to capture speech dynamics
and in addition might use heteroscedastic linear discriminant analysis (HLDA); or might skip the
delta and delta-delta coefficients and use splicing and an LDA-based projection followed perhaps
by heteroscedastic linear discriminant analysis or a global semi-tied co variance transform (also
known as maximum likelihood linear transform, or MLLT). Many systems use so-called
discriminative training techniques that dispense with a purely statistical approach to HMM
parameter estimation and instead optimize some classification-related measure of the training
data. Examples are maximum mutual information (MMI), minimum classification error (MCE)
and minimum phone error (MPE).

Decoding of the speech (the term for what happens when the system is presented with a new
utterance and must compute the most likely source sentence) would probably use the Viterbi
algorithm to find the best path, and here there is a choice between dynamically creating a
combination hidden Markov model, which includes both the acoustic and language model
information, and combining it statically beforehand (the finite state transducer, or FST,
approach).
A possible improvement to decoding is to keep a set of good candidates instead of just keeping
the best candidate, and to use a better scoring function (re scoring) to rate these good candidates
so that we may pick the best one according to this refined score. The set of candidates can be
kept either as a list (the N-best list approach) or as a subset of the models (a lattice). Re scoring
is usually done by trying to minimize the Bayes risk (or an approximation thereof): Instead of
taking the source sentence with maximal probability, we try to take the sentence that minimizes
the expectancy of a given loss function with regards to all possible transcriptions (i.e., we take
the sentence that minimizes the average distance to other possible sentences weighted by their
estimated probability). The loss function is usually the Levenshtein distance, though it can be
different distances for specific tasks; the set of possible transcriptions is, of course, pruned to
maintain tractability. Efficient algorithms have been devised to re score lattices represented as
weighted finite state transducers with edit distances represented themselves as a finite state
transducer verifying certain assumptions.
DYNAMIC TIME WARPING (DTW)-BASED SPEECH RECOGNITION
Dynamic time warping is an approach that was historically used for speech recognition but has
now largely been displaced by the more successful HMM-based approach.
Dynamic time warping is an algorithm for measuring similarity between two sequences that may
vary in time or speed. For instance, similarities in walking patterns would be detected, even if in
one video the person was walking slowly and if in another he or she were walking more quickly,

or even if there were accelerations and deceleration during the course of one observation. DTW
has been applied to video, audio, and graphics – indeed, any data that can be turned into a linear
representation can be analyzed with DTW.
A well-known application has been automatic speech recognition, to cope with different
speaking speeds. In general, it is a method that allows a computer to find an optimal match
between two given sequences (e.g., time series) with certain restrictions. That is, the sequences
are "warped" non-linearly to match each other. This sequence alignment method is often used in
the context of hidden Markov models.
NEURAL NETWORKS
Neural networks emerged as an attractive acoustic modeling approach in ASR in the late 1980s.
Since then, neural networks have been used in many aspects of speech recognition such as
phoneme classification, isolated word recognition, and speaker adaptation.
In contrast to HMMs, neural networks make no assumptions about feature statistical properties
and have several qualities making them attractive recognition models for speech recognition.
When used to estimate the probabilities of a speech feature segment, neural networks allow
discriminative training in a natural and efficient manner. Few assumptions on the statistics of
input features are made with neural networks. However, in spite of their effectiveness in
classifying short-time units such as individual phones and isolated words, neural networks are
rarely successful for continuous recognition tasks, largely because of their lack of ability to
model temporal dependencies.
However, recently Recurrent Neural Networks(RNN's) and Time Delay Neural
Networks(TDNN's)[44] have been used which have been shown to be able to identify latent
temporal dependencies and use this information to perform the task of speech recognition. This
however enormously increases the computational cost involved and hence makes the process of
speech recognition slower. A lot of research is still going on in this field to ensure that TDNN's
and RNN's can be used in a more computationally affordable way to improve the Speech
Recognition Accuracy immensely.

Deep Neural Networks and Denoising Autoencoders[45] are also being experimented with to
tackle this problem in an effective manner.
Due to the inability of traditional Neural Networks to model temporal dependencies, an
alternative approach is to use neural networks as a pre-processing e.g. feature transformation,

DEEP NEURAL NETWORKS AND OTHER DEEP LEARNING MODELS
A deep neural network (DNN) is an artificial neural network with multiple hidden layers of units
between the input and output layers. Similar to shallow neural networks, DNNs can model
complex non-linear relationships. DNN architectures generate compositional models, where
extra layers enable composition of features from lower layers, giving a huge learning capacity
and thus the potential of modeling complex patterns of speech data.[47] The DNN is the most
popular type of deep learning architectures successfully used as an acoustic model for speech
recognition since 2010.
The success of DNNs in large vocabulary speech recognition occurred in 2010 by industrial
researchers, in collaboration with academic researchers, where large output layers of the DNN
based on context dependent HMM states constructed by decision trees were adopted. See
comprehensive reviews of this development and of the state of the art as of October 2014 in the
recent Springer book from Microsoft Research. See also the related background of automatic
speech recognition and the impact of various machine learning paradigms including notably deep
learning in a recent overview article.
One fundamental principle of deep learning is to do away with hand-crafted feature
engineering and to use raw features. This principle was first explored successfully in the
architecture of deep autoencoder on the "raw" spectrogram or linear filter-bank features, showing
its superiority over the Mel-Cepstral features which contain a few stages of fixed transformation
from spectrograms. The true "raw" features of speech, waveforms, have more recently been
shown to produce excellent larger-scale speech recognition results.
Since the initial successful debut of DNNs for speech recognition around 2009-2011, there have
been huge new progresses made. This progress (as well as future directions) has been
summarized into the following eight major areas:

1. Scaling up/out and speedup DNN training and decoding;
2. Sequence discriminative training of DNNs;
3. Feature processing by deep models with solid understanding of the underlying
mechanisms;
4. Adaptation of DNNs and of related deep models;
5. Multi-task and transfer learning by DNNs and related deep models;
6. Convolution neural networks and how to design them to best exploit domain knowledge
of speech;
7. Recurrent neural network and its rich LSTM variants;
8. Other types of deep models including tensor-based models and integrated deep
generative/discriminative models.
Large-scale automatic speech recognition is the first and the most convincing successful case of
deep learning in the recent history, embraced by both industry and academic across the board.
Between 2010 and 2014, the two major conferences on signal processing and speech recognition,
IEEE-ICASSP and Interspeech, have seen near exponential growth in the numbers of accepted
papers in their respective annual conference papers on the topic of deep learning for speech
recognition. More importantly, all major commercial speech recognition systems (e.g., Microsoft
Cortana, Xbox, Skype Translator, Google Now, Apple Siri, Baidu and iFlyTek voice search, and
a range of Nuance speech products, etc.) nowadays are based on deep learning methods.
APPLICATIONS:
IN-CAR SYSTEMS
Typically a manual control input, for example by means of a finger control on the steering-
wheel, enables the speech recognition system and this is signalled to the driver by an audio
prompt. Following the audio prompt, the system has a "listening window" during which it may
accept a speech input for recognition.

Simple voice commands may be used to initiate phone calls, select radio stations or play music
from a compatible smartphone, MP3 player or music-loaded flash drive. Voice recognition
capabilities vary between car make and model. Some of the most recentcar models offer natural-
language speech recognition in place of a fixed set of commands. Allowing the driver to use full
sentences and common phrases. With such systems there is, therefore, no need for the user to
memorize a set of fixed command words.
HEALTH CARE
Medical documentation
In the health care sector, speech recognition can be implemented in front-end or back-end of the
medical documentation process. Front-end speech recognition is where the provider dictates into
a speech-recognition engine, the recognized words are displayed as they are spoken, and the
dictator is responsible for editing and signing off on the document. Back-end or deferred speech
recognition is where the provider dictates into a digital dictation system, the voice is routed
through a speech-recognition machine and the recognized draft document is routed along with
the original voice file to the editor, where the draft is edited and report finalized. Deferred speech
recognition is widely used in the industry currently.
One of the major issues relating to the use of speech recognition in healthcare is that the
American Recovery and Reinvestment Act of 2009 (ARRA) provides for substantial financial
benefits to physicians who utilize an EMR according to "Meaningful Use" standards. These
standards require that a substantial amount of data be maintained by the EMR (now more
commonly referred to as an Electronic Health Record or EHR). The use of speech recognition is
more naturally suited to the generation of narrative text, as part of a radiology/pathology
interpretation, progress note or discharge summary: the ergonomic gains of using speech
recognition to enter structured discrete data (e.g., numeric values or codes from a list or
a controlled vocabulary) are relatively minimal for people who are sighted and who can operate a
keyboard and mouse.

A more significant issue is that most EHRs have not been expressly tailored to take advantage of
voice-recognition capabilities. A large part of the clinician's interaction with the EHR involves
navigation through the user interface using menus, and tab/button clicks, and is heavily
dependent on keyboard and mouse: voice-based navigation provides only modest ergonomic
benefits. By contrast, many highly customized systems for radiology or pathology dictation
implement voice "macros", where the use of certain phrases - e.g., "normal report", will
automatically fill in a large number of default values and/or generate boilerplate, which will vary
with the type of the exam - e.g., a chest X-ray vs. a gastrointestinal contrast series for a radiology
system.
As an alternative to this navigation by hand, cascaded use of speech recognition and information
extraction has been studied as a way to fill out a handover form for clinical proofing and sign-
off. The results are encouraging, and the paper also opens data, together with the related
performance benchmarks and some processing software, to the research and development
community for studying clinical documentation and language-processing.
MILITARY
High-performance fighter aircraft
Substantial efforts have been devoted in the last decade to the test and evaluation of speech
recognition in fighter aircraft. Of particular note is the U.S. program in speech recognition for the
Advanced Fighter Technology Integration (AFTI)/F-16 aircraft (F-16 VISTA), and a program in
France installing speech recognition systems on Mirage aircraft, and also programs in the UK
dealing with a variety of aircraft platforms. In these programs, speech recognizers have been
operated successfully in fighter aircraft, with applications including: setting radio frequencies,
commanding an autopilot system, setting steer-point coordinates and weapons release
parameters, and controlling flight display.

Working with Swedish pilots flying in the JAS-39 Gripen cockpit, Englund (2004) found
recognition deteriorated with increasing G-loads. It was also concluded that adaptation greatly
improved the results in all cases and introducing models for breathing was shown to improve
recognition scores significantly. Contrary to what might be expected, no effects of the broken
English of the speakers were found. It was evident that spontaneous speech caused problems for
the recognizer, as could be expected. A restricted vocabulary, and above all, a proper syntax,
could thus be expected to improve recognition accuracy substantially.
The Eurofighter Typhoon currently in service with the UK RAF employs a speaker-dependent
system, i.e. it requires each pilot to create a template. The system is not used for any safety
critical or weapon critical tasks, such as weapon release or lowering of the undercarriage, but is
used for a wide range of other cockpit functions. Voice commands are confirmed by visual
and/or aural feedback. The system is seen as a major design feature in the reduction of
pilot workload, and even allows the pilot to assign targets to himself with two simple voice
commands or to any of his wingmen with only five commands.
Speaker-independent systems are also being developed and are in testing for the F35 Lightning
II (JSF) and the Alenia Aermacchi M-346 Master lead-in fighter trainer. These systems have
produced word accuracy in excess of 98%.
HELICOPTERS
The problems of achieving high recognition accuracy under stress and noise pertain strongly to
the helicopter environment as well as to the jet fighter environment. The acoustic noise problem
is actually more severe in the helicopter environment, not only because of the high noise levels
but also because the helicopter pilot, in general, does not wear a facemask, which would reduce
acoustic noise in the microphone. Substantial test and evaluation programs have been carried out
in the past decade in speech recognition systems applications in helicopters, notably by the U.S.
Army Avionics Research and Development Activity (AVRADA) and by the Royal Aerospace
Establishment (RAE) in the UK. Work in France has included speech recognition in the Puma

helicopter. There has also been much useful work in Canada. Results have been encouraging,
and voice applications have included: control of communication radios, setting
of navigation systems, and control of an automated target handover system.
As in fighter applications, the overriding issue for voice in helicopters is the impact on pilot
effectiveness. Encouraging results are reported for the AVRADA tests, although these represent
only a feasibility demonstration in a test environment. Much remains to be done both in speech
recognition and in overall speech technology in order to consistently achieve performance
improvements in operational settings.
PROBLEMS:
In recent years, single-channel speech enhancement has attracted a considerable amount of
research attention because of the growing challenges in many important real-world applications,
including mobile speech communication, hearing aids design and robust speech recognition. The
goal of speech enhancement is to improve the intelligibility and quality of a noisy speech signal
degraded in adverse conditions. However, the performance of speech enhancement in real
acoustic environments is not always satisfactory. Numerous speech enhancement methods were
developed over the past several decades. Spectral subtraction subtracts an estimate of the short-
term noise spectrum to produce an estimated spectrum of the clean speech. In the iterative wiener
filtering was presented using an all-pole model.
A common problem usually encountered in these conventional methods is that the resulting
enhanced speech often suffers from an annoying artifact called “musical noise”. Another notable
work was the minimum mean-square error (MMSE) estimator introduced by Ephraim and Malah
[6]; their MMSE log-spectral amplitude estimator could result in much lower residual noise
without further affecting the speech quality. An optimally-modified log-spectral amplitude (OM-
LSA) speech estimator and a minima controlled recursive averaging (MCRA) noise estimation
approach were also presented in although these traditional MMSE-based methods are able to
yield lower musical noise (e.g., [10], [11]), a trade-off in reducing speech distortion and residual

noise needs to be made due to the sophisticated statistical properties of the interactions between
speech and noise signals. Most of these unsupervised methods are based on either the additive
nature of the background noise, or the statistical properties of the speech and noise signals.
However they often fail to track non-stationary noise for real-world scenarios in unexpected
acoustic conditions. Considering the complex process of noise corruption, a nonlinear model,
like the neural networks, might be suitable for modeling the mapping relationship between the
noisy and clean speech signals. Early work on using shallow neural networks (SNNs) as
nonlinear filters to predict the clean signal in the time or frequency domain has been proposed
the SNN with only one hidden layer using 160 neurons was proposed to estimate the
instantaneous signal-to-noise ratios (SNRs) on the amplitude modulation spectrograms (AMS),
and then the noise could be suppressed according to the estimated SNRs of different channels.
However, the SNR was estimated in the limited frequency resolution with 15 channels and it was
not efficient to suppress the noise type with sharp spectral peaks. Furthermore, the small network
size can not fully learn the relationship between the noisy feature and the target SNRs.

CHAPTER 2
2.0 SYSTEM ANALYSIS
2.1 EXISTING SYSTEM:
Existing method using neural networks (NNs) in conventional joint density Gaussian mixture
model (JDGMM) based spectral conversion methods perform stably and effectively. However,
the speech generated by these methods suffer severe quality degradation due to the following two
factors: 1) inadequacy of JDGMM in modeling the distribution of spectral features as well as the
non-linear mapping relationship between the source and target speakers, 2) spectral detail loss
caused by the use of high-level spectral features such as mel-cepstra. Previously, we have
proposed to use the mixture of restricted Boltzmann machines (MoRBM) and the mixture of
Gaussian bidirectional associative memories (MoGBAM) to cope with these problems.
Previous methods to use a NN to construct a global non-linear mapping relationship between the
spectral envelopes of two speakers generatively trained by cascading two RBMs, which model
the distributions of spectral envelopes of source and target speakers respectively, using a
Bernoulli BAM (BBAM). Therefore, the proposed training method takes the advantage of the
strong modeling ability of RBMs in modeling the distribution of spectral envelopes and the
superiority of BAMs in deriving the conditional distributions for conversion. Careful
comparisons and analysis among the proposed method and some conventional methods are
presented in this paper. The subjective results show that the proposed method can significantly
improve the performance in terms of both similarity and naturalness compared to conventional
methods.

2.1.1 DISADVANTAGES:
Neural networks (NNs) with a single-layer architecture learning process, a large training set
ensures a powerful modeling capability to estimate the complicated nonlinear mapping from
observed noisy speech to desired clean signals. Acoustic context was found to improve the
continuity of speech to be separated from the background noises with the annoying musical
artifact commonly observed in conventional speech enhancement algorithms.
A series of pilot experiments were conducted under multi-condition training with more than 100
hours of simulated speech data, resulting in a bad generalization capability even in mismatched
testing conditions. When compared with the logarithmic minimum mean square error approach.
NN-based algorithm tends to achieve significant improvements in terms of various objective
quality measures. Furthermore, in a subjective preference evaluation with 10 listeners, 66.35% of
the subjects were found to prefer NN-based enhanced speech to that obtained with other
conventional low level technique.

2.2 PROPOSED SYSTEM:
We proposed a regression DNN based speech enhancement framework via training a deep and
wide neural network architecture using a large collection of heterogeneous training data with
four noise types. It was found that the annoying musical noise artifact could be greatly reduced
with the DNN-based algorithm and the enhanced speech also showed an improved speech
quality both in terms of objective and subjective measures DNN-based speech enhancement
framework to handle adverse conditions and non-stationary noise types in real-world situations.
In traditional speech enhancement techniques, the noise estimate is usually updated by averaging
the noisy speech power spectrum using time and frequency dependent smoothing factors, which
are adjusted based on the estimated speech presence probability in individual frequency bins its
noise tracking capacity is limited for highly non-stationary noise cases, and it tends to distort the
speech component in mixed signals if it is tuned for better noise reduction. In this work, the
acoustic context information, including the full frequency band and context frame expanding, is
well utilized to obtain the enhanced speech with reduced discontinuity.
Furthermore to improve the generalization capability we include more than 100 different noise
types in designing the training set for DNN which proved to be quite effective in handling
unseen noise types, especially non-stationary noise components. Three strategies are also
proposed to further improve the quality of enhanced speech and generalization capability of
DNNs. First, equalization between the global variance (GV) of the enhanced features and the
reference clean speech features is proposed to alleviate the over-smoothing issue in DNN-based
speech enhancement system. The second technique, called dropout, is a recently proposed
strategy for training neural networks on data sets where over-fitting may be a concern. While this
method was not designed for noise reduction, it was demonstrated to be useful for noise robust
speech recognition and we successfully apply it to a DNN as a regression model to produce a
network that has good generalization ability to variabilities in the input. Finally, noise aware
training (NAT), first proposed in is adopted to improve performance.

2.2.1 ADVANTAGES:
DNN training procedure used in and then propose several techniques to improve the baseline
DNN system so that the quality of the enhanced speech in matched noise conditions can be
maintained while the generalization capability to unseen noise can be increased. Our proposed
normalized clean Log-power spectra with the mask-based training targets and verified the
different initialization schemes. Then the evaluations of the proposed strategies demonstrated
their effectiveness to improve the generalization capacity to unseen noises.
The suppression against highly non-stationary noise was also found overall performance
comparisons on 15 unseen noises and on real-world noises between the proposed method
normalized clean Log-power spectra target was better than IRM and FFT-MASK at all
conditions in our experimental setup. IRM and FFT-MASK got the almost the same
performance. It should be noted that the proposed clean Log-power spectra normalized to mean
zero and unit variance is crucial, which is different from the FFT-MAG with the Log
compression followed by the percent normalization.
Finally, by using more noise types and the three proposed techniques, the PESQ improvements
of the proposed DNN approach over LogMMSE under unseen noise types in Table VIII are also
comparable to that under matched noise types reported in the STOI results to represent the
intelligibility of the enhanced speech were also presented in Table IX. LogMMSE is slightly
better than the noisy with an average STOI improvement from 0.81 to 0.82. The DNN baseline
trained with 100 hours got 0.86 STOI score on average. The proposed strategies could further
improve the performance.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:
2.3.1 HARDWARE REQUIREMENT:
 Processor - Pentium –IV
 Speed - 1.1 GHz
 RAM - 256 MB (min)
 Hard Disk - 20 GB
 Floppy Drive - 1.44 MB
 Key Board - Standard Windows Keyboard
 Mouse - Two or Three Button Mouse
 Monitor - SVGA
2.3.2 SOFTWARE REQUIREMENTS:
.NET
 Operating System : Windows XP or Win7
 Front End : Microsoft Visual Studio .NET 2008
 Script : C# Script
 Document : MS-Office 2007

CHAPTER 3
3.0 SYSTEM DESIGN:
Data Flow Diagram / Use Case Diagram / Flow Diagram:
 The DFD is also called as bubble chart. It is a simple graphical formalism that can be
used to represent a system in terms of the input data to the system, various processing
carried out on these data, and the output data is generated by the system
 The data flow diagram (DFD) is one of the most important modeling tools. It is used to
model the system components. These components are the system process, the data used
by the process, an external entity that interacts with the system and the information flows
in the system.
 DFD shows how the information moves through the system and how it is modified by a
series of transformations. It is a graphical technique that depicts information flow and the
transformations that are applied as data moves from input to output.
 DFD is also known as bubble chart. A DFD may be used to represent a system at any
level of abstraction. DFD may be partitioned into levels that represent increasing
information flow and functional detail.

NOTATION:
SOURCE OR DESTINATION OF DATA:
External sources or destinations, which may be people or organizations or other entities
DATA SOURCE:
Here the data referenced by a process is stored and retrieved.
PROCESS:
People, procedures or devices that produce data’s in the physical component is not identified.
DATA FLOW:
Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of
data.
MODELING RULES:
There are several common modeling rules when creating DFDs:
1. All processes must have at least one data flow in and one data flow out.
2. All processes should modify the incoming data, producing new forms of outgoing data.
3. Each data store must be involved with at least one data flow.
4. Each external entity must be involved with at least one data flow.
5. A data flow must be attached to at least one process.

3.2 DATAFLOW DIAGRAM
UML DIAGRAMS:
3.2 USE CASE DIAGRAM:
3.3 CLASS DIAGRAM:
3.4 SEQUENCE DIAGRAM:
3.5 ACTIVITY DIAGRAM:

CHAPTER 4
4.0 IMPLEMENTATION:
MINIMUM MEAN SQUARE ERROR (MMSE):

4.1 ALGORITHM:
SPEECH ENHANCEMENT ALGORITHMS (DNN):

4.2 MODULES:
AUDIO PREPROCESSING:
NOISE AWARE TRAINING:
NOISE REDUCTION DFT:
DEEP NEURAL NETWORKS:
SPEECH ENHANCEMENT:

4.3 MODULE DESCRIPTION:
AUDIO PREPROCESSING:
NOISE AWARE TRAINING:
NOISE REDUCTION DFT:
DEEP NEURAL NETWORKS:
SPEECH ENHANCEMENT:

CHAPTER 5
5.0 SYSTEM STUDY:
5.1 FEASIBILITY STUDY:
The feasibility of the project is analyzed in this phase and business proposal is put forth with a
very general plan for the project and some cost estimates. During system analysis the feasibility
study of the proposed system is to be carried out. This is to ensure that the proposed system is
not a burden to the company. For feasibility analysis, some understanding of the major
requirements for the system is essential.
Three key considerations involved in the feasibility analysis are
 ECONOMICAL FEASIBILITY
 TECHNICAL FEASIBILITY
 SOCIAL FEASIBILITY
5.1.1 ECONOMICAL FEASIBILITY:
This study is carried out to check the economic impact that the system will have on the
organization. The amount of fund that the company can pour into the research and development
of the system is limited. The expenditures must be justified. Thus the developed system as well
within the budget and this was achieved because most of the technologies used are freely
available. Only the customized products had to be purchased.
5.1.2 TECHNICAL FEASIBILITY:
This study is carried out to check the technical feasibility, that is, the technical requirements of
the system. Any system developed must not have a high demand on the available technical

resources. This will lead to high demands on the available technical resources. This will lead to
high demands being placed on the client. The developed system must have a modest
requirement, as only minimal or null changes are required for implementing this system.
5.1.3 SOCIAL FEASIBILITY:
The aspect of study is to check the level of acceptance of the system by the user. This includes
the process of training the user to use the system efficiently. The user must not feel threatened by
the system, instead must accept it as a necessity. The level of acceptance by the users solely
depends on the methods that are employed to educate the user about the system and to make him
familiar with it. His level of confidence must be raised so that he is also able to make some
constructive criticism, which is welcomed, as he is the final user of the system.

5.2 SYSTEM TESTING:
Testing is a process of checking whether the developed system is working according to the
original objectives and requirements. It is a set of activities that can be planned in advance and
conducted systematically. Testing is vital to the success of the system. System testing makes a
logical assumption that if all the parts of the system are correct, the global will be successfully
achieved. In adequate testing if not testing leads to errors that may not appear even many
months. This creates two problems, the time lag between the cause and the appearance of the
problem and the effect of the system errors on the files and records within the system. A small
system error can conceivably explode into a much larger Problem. Effective testing early in the
purpose translates directly into long term cost savings from a reduced number of errors. Another
reason for system testing is its utility, as a user-oriented vehicle before implementation. The best
programs are worthless if it produces the correct outputs.
5.2.1 UNIT TESTING:
A program represents the logical elements of a system. For a program to run satisfactorily, it
must compile and test data correctly and tie in properly with other programs. Achieving an error
free program is the responsibility of the programmer. Program testing checks for two types of
errors: syntax and logical. Syntax error is a program statement that violates one or more rules
of the language in which it is written. An improperly defined field dimension or omitted
keywords are common syntax errors. These errors are shown through error message generated by
the computer. For Logic errors the programmer must examine the output carefully.

UNIT TESTING:
5.1.3 FUNCTIONAL TESTING:
Functional testing of an application is used to prove the application delivers correct results, using
enough inputs to give an adequate level of confidence that will work correctly for all sets of
inputs. The functional testing will need to prove that the application works for each client type
and that personalization function work correctly. When a program is tested, the actual output is
compared with the expected output. When there is a discrepancy the sequence of instructions
must be traced to determine the problem. The process is facilitated by breaking the program into
self-contained portions, each of which can be checked at certain key points. The idea is to
compare program values against desk-calculated values to isolate the problems.
Description Expected result
Test for application window
properties.
All the properties of the windows are to be
properly aligned and displayed.
Test for mouse operations.
All the mouse operations like click, drag,
etc. must perform the necessary operations
without any exceptions.

FUNCTIONAL TESTING:
5.1. 4 NON-FUNCTIONAL TESTING:
The Non Functional software testing encompasses a rich spectrum of testing strategies,
describing the expected results for every test case. It uses symbolic analysis techniques. This
testing used to check that an application will work in the operational environment. Non-
functional testing includes:
 Load testing
 Performance testing
 Usability testing
 Reliability testing
 Security testing
Test for all modules.
All peers should communicate in the
group.
Test for various peer in a distributed
network framework as it display all
users available in the group.
The result after execution should
give the accurate result.

5.1.5 LOAD TESTING:
An important tool for implementing system tests is a Load generator. A Load generator is
essential for testing quality requirements such as performance and stress. A load can be a real
load, that is, the system can be put under test to real usage by having actual telephone users
connected to it. They will generate test input data for system test.
Load Testing
5.1.5 PERFORMANCE TESTING:
Performance tests are utilized in order to determine the widely defined performance of the
software system such as execution time associated with various parts of the code, response time
and device utilization. The intent of this testing is to identify weak points of the software system
and quantify its shortcomings.
It is necessary to ascertain that the
application behaves correctly under
loads when ‘Server busy’ response is
received.
Should designate another active node as
a Server.

PERFORMANCE TESTING:
5.1.6 RELIABILITY TESTING:
The software reliability is the ability of a system or component to perform its required functions
under stated conditions for a specified period of time and it is being ensured in this testing.
Reliability can be expressed as the ability of the software to reveal defects under testing
conditions, according to the specified requirements. It the portability that a software system will
operate without failure under given conditions for a given time interval and it focuses on the
behavior of the software element. It forms a part of the software quality control team.
This is required to assure that an
application perforce adequately, having
the capability to handle many peers,
delivering its results in expected time
and using an acceptable level of
resource and it is an aspect of
operational management.
Should handle large input values,
and produce accurate result in a
expected time.

RELIABILITY TESTING:
This is to check that the server is rugged
and reliable and can handle the failure of
any of the components involved in
provide the application.
In case of failure of the server
an alternate server should take
over the job.
5.1.7 SECURITY TESTING:
Security testing evaluates system characteristics that relate to the availability, integrity and
confidentiality of the system data and services. Users/Clients should be encouraged to make sure
their security needs are very clearly known at requirements time, so that the security issues can
be addressed by the designers and testers.

SECURITY TESTING:
5.1.7 WHITE BOX TESTING:
White box testing, sometimes called glass-box testing is a test case design method that
uses the control structure of the procedural design to derive test cases. Using white box
testing method, the software engineer can derive test cases. The White box testing focuses
on the inner structure of the software structure to be tested.
Description
Expected result
Checking that the user identification is
authenticated.
In case failure it should not be
connected in the framework.
Check whether group keys in a tree are
shared by all peers.
The peers should know group key
in the same group.

5.1.8 WHITE BOX TESTING:
5.1.9 BLACK BOX TESTING:
Black box testing, also called behavioral testing, focuses on the functional requirements of the
software. That is, black testing enables the software engineer to derive sets of input
conditions that will fully exercise all functional requirements for a program. Black box
testing is not alternative to white box techniques. Rather it is a complementary approach that
is likely to uncover a different class of errors than white box methods. Black box testing
attempts to find errors which focuses on inputs, outputs, and principle function of a software
module. The starting point of the black box testing is either a specification or code. The contents
of the box are hidden and the stimulated software should produce the desired results.
Exercise all logical decisions on
their true and false sides.
All the logical decisions must be valid.
Execute all loops at their boundaries
and within their operational bounds.
All the loops must be finite.
Exercise internal data structures to
ensure their validity.
All the data structures must be valid.

5.1.10 BLACK BOX TESTING:
All the above system testing strategies are carried out in as the development, documentation and
institutionalization of the proposed goals and related policies is essential.
To check for incorrect or missing
functions.
All the functions must be valid.
To check for interface errors.
The entire interface must function
normally.
To check for errors in a data structures
or external data base access.
The database updation and retrieval
must be done.
To check for initialization and
termination errors.
All the functions and data structures
must be initialized properly and
terminated normally.

CHAPTER 7
7.0 SOFTWARE SPECIFICATION:
7.1 FEATURES OF .NET:
Microsoft .NET is a set of Microsoft software technologies for rapidly building and integrating
XML Web services, Microsoft Windows-based applications, and Web solutions. The .NET
Framework is a language-neutral platform for writing programs that can easily and securely
interoperate. There’s no language barrier with .NET: there are numerous languages available to
the developer including Managed C++, C#, Visual Basic and Java Script.
The .NET framework provides the foundation for components to interact seamlessly, whether
locally or remotely on different platforms. It standardizes common data types and
communications protocols so that components created in different languages can easily
interoperate.
“.NET” is also the collective name given to various software components built upon the .NET
platform. These will be both products (Visual Studio.NET and Windows.NET Server, for
instance) and services (like Passport, .NET My Services, and so on).

7.2 THE .NET FRAMEWORK
The .NET Framework has two main parts:
1. The Common Language Runtime (CLR).
2. A hierarchical set of class libraries.
The CLR is described as the “execution engine” of .NET. It provides the environment within
which programs run. The most important features are
 Conversion from a low-level assembler-style language, called Intermediate
Language (IL), into code native to the platform being executed on.
 Memory management, notably including garbage collection.
 Checking and enforcing security restrictions on the running code.
 Loading and executing programs, with version control and other such features.
 The following features of the .NET framework are also worth description:
Managed Code
The code that targets .NET, and which contains certain extra Information - “metadata” - to
describe itself. Whilst both managed and unmanaged code can run in the runtime, only managed
code contains the information that allows the CLR to guarantee, for instance, safe execution and
interoperability.

Managed Data
With Managed Code comes Managed Data. CLR provides memory allocation and Deal location
facilities, and garbage collection. Some .NET languages use Managed Data by default, such as
C#, Visual Basic.NET and JScript.NET, whereas others, namely C++, do not. Targeting CLR
can, depending on the language you’re using, impose certain constraints on the features
available. As with managed and unmanaged code, one can have both managed and unmanaged
data in .NET applications - data that doesn’t get garbage collected but instead is looked after by
unmanaged code.
Common Type System
The CLR uses something called the Common Type System (CTS) to strictly enforce type-safety.
This ensures that all classes are compatible with each other, by describing types in a common
way. CTS define how types work within the runtime, which enables types in one language to
interoperate with types in another language, including cross-language exception handling. As
well as ensuring that types are only used in appropriate ways, the runtime also ensures that code
doesn’t attempt to access memory that hasn’t been allocated to it.
Common Language Specification
The CLR provides built-in support for language interoperability. To ensure that you can develop
managed code that can be fully used by developers using any programming language, a set of
language features and rules for using them called the Common Language Specification (CLS)
has been defined. Components that follow these rules and expose only CLS features are
considered CLS-compliant.

7.3 THE CLASS LIBRARY
.NET provides a single-rooted hierarchy of classes, containing over 7000 types. The root of the
namespace is called System; this contains basic types like Byte, Double, Boolean, and String, as
well as Object. All objects derive from System. Object. As well as objects, there are value types.
Value types can be allocated on the stack, which can provide useful flexibility. There are also
efficient means of converting value types to object types if and when necessary.
The set of classes is pretty comprehensive, providing collections, file, screen, and network I/O,
threading, and so on, as well as XML and database connectivity.
The class library is subdivided into a number of sets (or namespaces), each providing distinct
areas of functionality, with dependencies between the namespaces kept to a minimum.
7.4 LANGUAGES SUPPORTED BY .NET
The multi-language capability of the .NET Framework and Visual Studio .NET enables
developers to use their existing programming skills to build all types of applications and XML
Web services. The .NET framework supports new versions of Microsoft’s old favorites Visual
Basic and C++ (as VB.NET and Managed C++), but there are also a number of new additions to
the family.
Visual Basic .NET has been updated to include many new and improved language features that
make it a powerful object-oriented programming language. These features include inheritance,
interfaces, and overloading, among others. Visual Basic also now supports structured exception
handling, custom attributes and also supports multi-threading.

Visual Basic .NET is also CLS compliant, which means that any CLS-compliant language can
use the classes, objects, and components you create in Visual Basic .NET.
Managed Extensions for C++ and attributed programming are just some of the enhancements
made to the C++ language. Managed Extensions simplify the task of migrating existing C++
applications to the new .NET Framework.
C# is Microsoft’s new language. It’s a C-style language that is essentially “C++ for Rapid
Application Development”. Unlike other languages, its specification is just the grammar of the
language. It has no standard library of its own, and instead has been designed with the intention
of using the .NET libraries as its own.
Microsoft Visual J# .NET provides the easiest transition for Java-language developers into the
world of XML Web Services and dramatically improves the interoperability of Java-language
programs with existing software written in a variety of other programming languages.
Active State has created Visual Perl and Visual Python, which enable .NET-aware applications
to be built in either Perl or Python. Both products can be integrated into the Visual Studio .NET
environment. Visual Perl includes support for Active State’s Perl Dev Kit.
Other languages for which .NET compilers are available include
 FORTRAN
 COBOL
 Eiffel

ASP.NET
XML WEB SERVICES
Windows Forms
Base Class Libraries
Common Language Runtime
Operating System
Fig1 .Net Framework
C#.NET is also compliant with CLS (Common Language Specification) and supports
structured exception handling. CLS is set of rules and constructs that are supported by the
CLR (Common Language Runtime). CLR is the runtime environment provided by the .NET
Framework; it manages the execution of the code and also makes the development process
easier by providing services.
C#.NET is a CLS-compliant language. Any objects, classes, or components that created in
C#.NET can be used in any other CLS-compliant language. In addition, we can use objects,
classes, and components created in other CLS-compliant languages in C#.NET .The use of
CLS ensures complete interoperability among applications, regardless of the languages used
to create the application.

CONSTRUCTORS AND DESTRUCTORS:
Constructors are used to initialize objects, whereas destructors are used to destroy them. In other
words, destructors are used to release the resources allocated to the object. In C#.NET the sub
finalize procedure is available. The sub finalize procedure is used to complete the tasks that must
be performed when an object is destroyed. The sub finalize procedure is called automatically
when an object is destroyed. In addition, the sub finalize procedure can be called only from the
class it belongs to or from derived classes.
GARBAGE COLLECTION
Garbage Collection is another new feature in C#.NET. The .NET Framework monitors allocated
resources, such as objects and variables. In addition, the .NET Framework automatically releases
memory for reuse by destroying objects that are no longer in use.
In C#.NET, the garbage collector checks for the objects that are not currently in use by
applications. When the garbage collector comes across an object that is marked for garbage
collection, it releases the memory occupied by the object.
OVERLOADING
Overloading is another feature in C#. Overloading enables us to define multiple procedures with
the same name, where each procedure has a different set of arguments. Besides using
overloading for procedures, we can use it for constructors and properties in a class.

MULTITHREADING:
C#.NET also supports multithreading. An application that supports multithreading can handle
multiple tasks simultaneously, we can use multithreading to decrease the time taken by an
application to respond to user interaction.
STRUCTURED EXCEPTION HANDLING
C#.NET supports structured handling, which enables us to detect and remove errors at runtime.
In C#.NET, we need to use Try…Catch…Finally statements to create exception handlers. Using
Try…Catch…Finally statements, we can create robust and effective exception handlers to
improve the performance of our application.

7.5 THE .NET FRAMEWORK
The .NET Framework is a new computing platform that simplifies application development in
the highly distributed environment of the Internet.
OBJECTIVES OF .NET FRAMEWORK
1. To provide a consistent object-oriented programming environment whether object codes is
stored and executed locally on Internet-distributed, or executed remotely.
2. To provide a code-execution environment to minimizes software deployment and
guarantees safe execution of code.
3. Eliminates the performance problems.
There are different types of application, such as Windows-based applications and Web-based
applications.

7.6 FEATURES OF SQL-SERVER
The OLAP Services feature available in SQL Server version 7.0 is now called SQL Server 2000
Analysis Services. The term OLAP Services has been replaced with the term Analysis Services.
Analysis Services also includes a new data mining component. The Repository component
available in SQL Server version 7.0 is now called Microsoft SQL Server 2000 Meta Data
Services. References to the component now use the term Meta Data Services. The term
repository is used only in reference to the repository engine within Meta Data Services
SQL-SERVER database consist of six type of objects,
They are,
1. TABLE
2. QUERY
3. FORM
4. REPORT
5. MACRO

7.7 TABLE:
A database is a collection of data about a specific topic.
VIEWS OF TABLE:
We can work with a table in two types,
1. Design View
2. Datasheet View
DesignView
To build or modify the structure of a table we work in the table design
view. We can specify what kind of data will be hold.
Datasheet View
To add, edit or analyses the data itself we work in tables datasheet view mode.
QUERY:
A query is a question that has to be asked the data. Access gathers data that answers the question
from one or more table. The data that make up the answer is either dynaset (if you edit it) or a
snapshot (it cannot be edited).Each time we run query, we get latest information in the dynaset.
Access either displays the dynaset or snapshot for us to view or perform an action on it, such as
deleting or updating.

CHAPTER 7
7.0 APPENDIX
7.1 SAMPLE SCREEN SHOTS:
7.2 SAMPLE SOURCE CODE:

CHAPTER 8
8.1 CONCLUSION AND FUTURE:
In this paper, a DNN-based framework for speech enhancement is proposed. Among the various
DNN configurations, a large training set is crucial to learn the rich structure of the mapping
function between noisy and clean speech features. It was found that the application of more
acoustic context information improves the system performance and makes the enhanced speech
less discontinuous. Moreover, multi-condition training with many kinds of noise types can
achieve a good generalization capability to unseen noise environments. By doing so, the
proposed DNN framework is also powerful to cope with non-stationary noises in real-world
environments.
An over-smoothing problem in speech quality was found in the MMSE-optimized DNNs and
one proposed post-processing technique, called GV equalization, was effective in brightening the
formant spectra of the enhanced speech signals. Two improved training techniques were further
adopted to reduce the residual noise and increase the performance. Compared with the
LogMMSE method, significant improvements were achieved across different unseen noise
conditions. Another interesting observation was that the proposed DNN-based speech
enhancement system is quite effective for dealing with real-world noisy speech in different
languages and across different recording conditions not observed during DNN training.
In future studies, we would increase the speech diversity by first incorporating clean speech data
from a rich collection of materials covering more languages and speakers. Second, there are
many factors in designing the training set. We would utilize principles in experimental design
[54], [55] for multi-factor analysis to alleviate the requirement of a huge amount of training data
and still maintain a good generalization capability of the DNN model. Third, some other
features, such as Gammatone filterbank power spectra [50], Multi-resolution cochleagram
feature [56], will be adopted as in [50] to enrich the input information to DNNs. Finally, a
dynamic noise adaptation scheme will also be investigated for the purpose of improving tracking
of non-stationary noises.

CHAPTER 9
9.1 REFERENCES:
[1] L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, “Voice conversion using deep neural
networks with layer-wise generative training,” IEEE/ACM Trans. Audio, Speech, Lang.
Process., vol. 22, no. 12, pp. 1859–1872, Dec. 2014.
[2] Z.-H. Ling, L. Deng, and D. Yu, “Modeling spectral envelopes using restricted Boltzmann
machines and deep belief networks for statistical parametric speech synthesis,” IEEE Trans.
Audio, Speech, Lang. Process., vol. 21, no. 10, pp. 2129–2139, Oct. 2013.
[3] B.-Y. Xia and C.-C. Bao, “Speech enhancement with weighted denoising Auto-Encoder,” in
Proc. Interspeech, 2013, pp. 3444–3448. [23] X.-G. Lu, Y. Tsao, S. Matsuda, and C. Hori,
“Speech enhancement based on deep denoising Auto-Encoder,” in Proc. Interspeech, 2013, pp.
436–440.
[4] A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen, and A. Y. Ng, “Recurrent neural
networks for noise reduction in robust ASR,” in Proc. Interspeech, 2012, pp. 22–25.
[5] M. Wollmer, Z. Zhang, F. Weninger, B. Schuller, and G. Rigoll, “Feature enhancement by
bidirectional LSTM networks for conversational speech recognition in highly non-stationary
noise,” in Proc. ICASSP, 2013, pp. 6822–6826.
[6] H. Christensen, J. Barker, N. Ma, and P. D. Green, “The CHiME corpus: A resource and a
challenge for computational hearing in multisource environments,” in Proc. Interspeech, 2010,
pp. 1918–1921.
[7] Y. X. Wang and D. L. Wang, “Towards scaling up classification-based speech separation,”
IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 7, pp. 1381–1390, Jul. 2013.

speech enhancement

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to speech enhancement

Similar to speech enhancement (20)

Recently uploaded

Recently uploaded (20)

speech enhancement