voice recognition

DR.HARISINGHGOURVISHWAVIDYALAYA
TOPIC :- VOICE ANALYSIS
SUBMITTEDTO:-
Dr. NAVJOT KAUR KANWAL
( DEPARTMENTOFCRIMINOLOGYANDFORENSICSCIENCE)
SUBMITTEDBY :-
NIKHIL KUMAR SINGH
REGISTRAION
NO.=Y19242514

 A voice is more than just a string of sounds. Voices are
inherently complex.
 They signal a great deal of information in addition to
the intended linguistic message: the speaker’s sex, for
example, or their emotional state or state of health.
 Some of this information is clearly of potential forensic
importance.
 However, the different types of information conveyed
by a voice are not signalled in separate channels, but
are convolved together with the linguistic message.
 Knowledge of how this occurs is necessary to interpret
the ubiquitous variation in speech, and to assess the
comparability of speech samples.

 Speaker’s identification is the process of
determining whether two or more recordings of
speech are from the same speaker.
 Speaker identification can be very effective,
contributing to both conviction and elimination of
suspect. In this task, a voice print of an unknown
speaker is analysed and then compared with
speech samples of known speakers.
 The unknown speaker is identified as the speaker
whose model best matches the input model; it is
the identification of a person from characteristics of
voices.

 It is the process of automatically recognising
who is speaking by using the speaker specific
information included in the speech waves to
verify identities being claimed by people
accessing systems i.e.; it enables access control
of various services by voice.
 Applicable services include voice dialling,
banking over a telephone network, telephone
shopping, database access network,
information and reservation services, voice
mail, security control for confidential
information and remote access to computers.
 Another important application of speaker
recognition technology is as a forensic tool.

 Speaker identification in the forensic context is
usually about comparing voices.
 Probably the most common task involves the
comparison of one or more samples of an
offender’s voice with one or more samples of a
suspect’s voice.
 Voices are important things for humans. They
are the medium through which we do a lot of
communicating with the outside world: our
ideas, of course, but also our emotions and our
personality.

 Voices are also one of the media through which we
(successfully, most of the time) recognise other
humans who are important to us – members of our
family, media personalities, our friends and
enemies.
 Although evidence from DNA analysis is
potentially vastly more eloquent in its power than
evidence from voices, DNA can’t talk.
 It can’t be recorded planning, carrying out or
confessing to a crime. It can’t be so apparently
directly incriminating.
 Perhaps it is these features that contribute to the
interest and importance of FSI.

 Voices are extremely complex things, and some
of the inherent limitations of the forensic-
phonetic method are in part a consequence of
the interaction between their complexity and
the real world in which they are used.
 It is one of the aims of this paper to explain
how this comes about.

 The basic ideas which we will be focussing
over here are like; What speech sounds are like,
What is a voice? Forensic speaker
identification, voice comparison, Forensic-
phonetic speaker identification etc.

 The most common task in forensic speaker
identification involves the comparison of one
or more samples of an unknown voice
(sometimes called the questioned sample) with
one or more samples of a known voice.
 Often the unknown voice is that of the
individual alleged to have committed an
offence (hereafter called the offender) and the
known voice belongs to the suspect.

 Both prosecution and defence are then
concerned with being able to say whether the
two samples have come from the same person,
and thus being able either to identify the
suspect as the offender or to eliminate them
from suspicion.
 Sometimes it is important to be able to attach a
voice to an individual, or not, irrespective of
questions of guilt.

 In order to tell whether the same voice is
present in two or more speech samples, it must
be possible to tell the difference between, or
discriminate between voices.
 Put more accurately, it must be possible to
discriminate between samples from the voice of
the same speaker and samples from the voices
of different speakers.
 So identification in this sense is the secondary
result of a process of discrimination.

 The suspect may be identified as the offender
to the extent that the evidence supports the
hypothesis that questioned and suspect
samples are from the same voice.
 If not, no identification results.
 In this regard, therefore, the identification in
forensic speaker identification is somewhat
imprecise.

 In criminalistics, the identification process
seeks individualisation.
 identifying a person or an object means that it
is possible to distinguish this person or object
from all others on the surface of the Earth.
 The forensic individualisation process can be
seen as a reduction process beginning from an
initial population to a single person.

 Recently, an investigation concerning the
inference of identity in forensic speaker
recognition has shown the inadequacy of the
main solutions proposed to assess the evidence
in this field.
 The concept of identity underlying the
verification and the identification tasks does
not correspond to the concept of identity
accepted in forensic science.(C Cham pod, et
al., 2000)

 Speaker verification is the other common task
in speaker recognition.
 This is where ‘an identity claim from an
individual is accepted or rejected by comparing
a sample of his speech against a stored
reference sample by the individual whose
identity he is claiming’

 The aim of speaker identification is, not
surprisingly, identification: ‘to identify an
unknown voice as one or none of a set of known
voices’.
 One has a speech sample from an unknown
speaker, and a set of speech samples from different
speakers the identity of whom is known.
 The task is to compare the sample from the
unknown speaker with the known set of samples,
and determine whether it was produced by any of
the known speakers.

 In speaker identification, the reference set of
known speakers can be of two types: closed or
open.
 This distinction refers to whether the set is known
to contain a sample of the unknown voice or not.
 A closed reference set means that it is known that
the owner of the unknown voice is one of the
known speakers.
 An open set means that it is not known whether the
owner of the unknown voice is present in the
reference set or not.

 MFCC (Mel-Frequency Cepstral Coefficients )
 The most easiest and prevalent method to
extract spectral features is calculating the Mel-
Frequency Cepstral Coefficients (MFCC) from
human voice.
 It is one of the most popular methods of feature
extraction used in speech recognition systems.
It is based on frequency domain using the Mel
scale which is based on the human ear scale.

 Time domain features are less accurate than the
frequency domain features. The main aim of
feature extraction is to reduce the size of the
speech signal before the recognition of the
signal.
 Steps involved in feature extraction are pre-
emphasis, framing, windowing, fast fourier
transform, Mel-frequency filtering, Logarithmic
function and Discrete Cosine Transform etc.(
 Douglas A, et al., 1995)

 The first step in MFCC is pre-emphasis which is
used to boost the high frequencies of a speech
signal which are lost during speech production.
 Pre-emphasis is needed because high frequency
components of the speech signal have small
amplitude with respect to low frequency
components. Therefore higher frequencies are
artificially boosted in order to increase the signal-
to- noise ratio.
 Next, is framing which is used to block the frames
obtained by analog to digital conversion (ADC) of
speech signal.

 The number of samples in each frame is chosen
as 256 and the number of samples overlapping
between adjacent frames is 128.
 Overlapping frames are used to acquire the
information from the boundaries of the frame.
 Due to discontinuities at the start and the end
of the frame causes undesirable effects in the
frequency response, so windowing is used to
eliminate the discontinuities at the edges.

 In the discipline of speaker recognition a wide
range of methods and procedures are adopted
by the experts for identification.

 Such type of analysis involves a group of trained
phoneticians giving their judgement regarding the
similarity and dissimilarity between the two
speech events, after hearing the samples again and
again to find out some similarities in their
linguistic, phonetic and acoustic features.
 Human listeners are robust speaker recognizers
when presented with the degraded speech.
 Listener performance free from all types of
limitations like the signal to noise ratio, speech
bandwidth, the amount of speech material,
distortions occurring in the speech signals as a
result of speech coding, transmission systems, etc.

 In this technique, different utterances of the
speakers are segregated in respect of each speaker
by way of repeated listening of recorded
conversation.
 The segregated conversations of each speaker are
repeatedly heard to identify linguistic features and
phonetic features like articulation rate, flow of
speech, degree of vowels and consonant formation,
rhythm, striking time, pauses etc.
 There are cues in voice and speech behaviour,
which are individual and thus make it possible to
recognize the familiar voices.

 This involves the semi-automatic
measurements of particular acoustic speech
parameters such as vowel formants,
articulation rate, which is sometimes combined
with the results of auditory phonetic analysis
by a human expert.
 In 1941, an electro mechanical acoustic
spectrograph was developed by Dr. Raleph
Potter, Bell Telephone

 Laboratory, with an idea to convert sounds into
pictures. (Kent RD, Read C 2001) A sound
spectrograph is an instrument which is able to
give a permanent record of changing energy-
frequency distribution throughout the time of a
speech wave.
 The spectrograms are the graphic displays of
the amplitude as a function of both frequency
and time.

 Examiners visually inspect and compare
similarities or differences of patterns of the energy
distribution in the spectrograms.
 It is generally believed that formant structures and
other spectral characteristics which are evident
from a spectrogram are unique for each individual.
 The most widely used features are fundamental
frequencies, formant bandwidths, formant
frequencies, spectral composition of fricatives and
plosives for individual segments, and transitions.

 However, the main drawback of this voiceprint analysis is that the
spectrograms of the speech signal from same individual will show large
intra speaker variations, because of the fact that no speaker actually is
capable of producing two identical speech utterances(Gfroerer S 2003).
 This method is obviously neither objective nor superior to aural-
perceptual methods; it is basically a shifting of subjective judgement to
the visual domain.
 The objectivity, reliability and validity of the method have been discussed
controversially.
 The method has been widely used in the US, parts of Europe and other
countries until the 1980s but in the present scenario it has been losing its
ground.
 The FBI are using it for investigative purposes, most U.S. courts do not
accept voiceprint evidence.
 Today voiceprint identification is not used in forensic labs in Europe, but
still practised in developing countries like China, Vietnam etc.

 This approach differs greatly from the earlier
methods used for identification as it is both
universal as well as automatic.
 It is considered universal because it does not focus
on specific acoustic parameters and consider the
speech as a continuously varying complex wave or
signal.
 While, it’s automatic nature reduces the subjective
evaluation of any speech material to minimum.
 Most of such automatic identification system
today involves techniques like:

 The Gaussian Mixture Model(GMM) is a
parametric probability density function which
is represented as a weighted sum of Gaussian
component densities.
 It is used as a parametric model of probability
distribution of measuring features in biometric
systems.
 Gaussian Mixture Model(GMM) is used as a
classifier to compare the features extracted
from the MFCC with the stored templates

 The long- term speech spectrum is used as an important cue
of determining the voice quality . In this technique, large
number of feature vectors is collected for each known
speaker.
 The average and variance of each component of the feature
vector are calculated, and vector of mean value, and vector
of the variances, is used to model each speaker.
 A similar model is made for the unknown speaker.
 This technique is most useful for text independent
recognition, where large amount of data is required for
construction of the speaker’s model.
 This method will not be beneficial if the utterances are too
short and if contains the insufficient amount of data.

 The major disadvantage of long-term
averaging is that each speaker’s model consists
of a single cluster of data represented by an
average and variance vector.
 If the data contain multiple clusters of vectors,
the variance will be very high. Since human
speech is composed primarily of vowels, it is
natural to expect feature vectors to form
clusters, each one based on the pronunciation
of a specific vowel

 This is a technique in which each speaker’s model is
prepared which consists of several clusters of data, along
with their centroids.
 VQ reduces these sets of vectors to a codebook, which
provides an efficient way of building and comparing models
of speakers . VQ is used in several ways in speaker
recognition.
 In some systems it is used simply to compress data. In other
systems, VQ is a preprocessing step for other methods such
as HMMs.
 For text-dependent identification and verification several
codebooks are created or “trained” for each speaker, who
speaks a prescribed text several times.
 These codebooks are considered as the speaker’s template.
During the operational phase the same prescribed text is
spoken by the unknown person.

 The comparison is done on the basis of
observed differences or similarities between the
unknown person’s template, and each trained
template, after removing the variations in the
speaking rate.
 For text-independent speaker recognition a
single codebook is created for each speaker.
The codebook is considered as an accurate
created for each speaker.

 The codebook is considered as an accurate model
of the speaker because it is formed from a much
larger amount of speech than in the text-dependent
case.
 This method introduces a new factor affecting the
performance of the system, which is code-book
size. Larger codebooks will perform a better job of
characterizing a speaker’s voice, but these results
in increased computational expenses and the
danger of not producing results in real time, which
is a significant factor for verification.
 The advantage of this method is that it requires
only a small amount of data to create a speaker’s
model without causing any loss to the accuracy.

 The phenomenon of tendering tape recorded
conversation before law courts as evidence,
particularly in cases arising under the Prevention of
Corruption Act, where such conversation is recorded
by sending the complainant with a recording device to
the person demanding or offering bribe has almost
become a common practice now.
 In civil cases also parties may rely upon tape records
of relevant conversation to support their version.
 In such cases the court has to face various questions
regarding admissibility, nature and evidentiary value
of such a tape- recorded conversation.

 The Indian Evidence Act, prior to its being
amended by the Information Technology Act,
2000, mainly dealt with evidence, which was in
oral or documentary form.
 Nothing was there to point out about the
admissibility, nature and evidentiary value of a
conversation or statement recorded in an electro-
magnetic device.
 Being confronted with the question of this nature
and called upon to decide the same, the law courts
in India as well as in England devised and
developed principles so that such evidence must
be received in law courts and acted upon. (Adv KC
Suresh 2011)

 In India at Chandigarh Forensic Science
laboratories voice identification techniques are
regularly conducted and the Supreme Court has
held that voice identification data is admissible in
court.
 In India at Bangalore, SRC Institute of Speech and
Hearing has the facility for voice analysis.
 The All India Institute of Speech and Hearing,
Mysore, which has been working in the field for
many years now, even wants to start a one-year
PG Diploma course in forensic voice analysis.

 The Michigan state police set up a voice
identification unit in 1966. Sound spectrograph
evidence was first admitted into a court in 1967
during a military trial (court-martial), United
States v. Wright.
 Judge Ferguson wrote a lengthy dissent, saying
that voice identification by sound spectrograph
did not meet the Frye standard of general
acceptance by the scientific community.(Lisa
Yount 2007)
 The first reported application of the voiceprint
technique in a criminal proceeding occurred in the
1966 case of People v. Straehle.

 The defendant, a police officer, had telephoned
the operator of an illicit gambling enterprise to
warn him of an impending police raid.
 Later, during a grand jury inquiry, the police
officer denied making the call.
 At the ensuing perjury trial, the prosecution
introduced voiceprints of the telephone calls
and sample voiceprints of the defendant's
voice, supported by the expert opinion of
Lawrence Kersta that all recordings were of the
defendant's voice.( John F Decker, et al., 1977)

 In 1976 the New York Supreme Court pointed out, in
the case of People v. Rogers, that fifty different trial
courts had admitted spectrographic voice identification
evidence, as had fourteen out of fifteen U. S. District
Court judges, and only two out of thirty- seven states
considering the issue had rejected admission.
 The Rogers court stated that this technique, when
accompanied by aural examination and conducted by a
qualified examiner, had now reached the level of
general scientific acceptance by those who would be
expected to be familiar with its use, and as such, has
reached the level of scientific acceptance and reliability
necessary for admission. (Adv KC Suresh 2011).

 The lead story from Washington Post this
morning is regarding a recording that was thought
to be Donald Trump.
 Trump denied the recording was his voice.
 Primeau Forensics was asked by the media to
perform a forensic voice identification test to
determine if the unknown voice in the Washington
Post story features the voice of Donald
Trump.Primeau Forensics located a C-Span
interview from 1991 titled ‘Donald Trump on
Economic Recovery’.
 We chose this recording as the ‘known’ Donald
Trump voice for forensic comparison.

 We chose this older voice sample because it
was closer in time to the ‘unknown’ recording.
 The biometric software program that we used
is a Speech Pro Product titled ‘SIS 2’.
 We formatted each speech sample based on
training received from Owen Forensic Services
and loaded them into the biometric software.
 The result was a 98% mismatch meaning the
‘unknown’ voice recording that surfaced in the
Washington Post today is NOT the voice of
Donald Trump.

 As Cain explained in an article he wrote for the
Criminal Division of the U.S. Department of Justice —
in collaboration with Lonnie Smrkovski, chief of the
voiceprint unit of the Michigan State Police and Mindy
Wilson, a psychologist and private examiner practicing
in Lansing, Michigan — the fundamental principle of
voice identification rests on the fact that like a
fingerprint, every voice is unique and "individually
characteristic enough to distinguish it from others
through...analysis”.
 Fingerprints are identified through literal analysis;
voices are identified through comparative voiceprints.
 Cain points out that uniqueness in human speech is the
product of two general factors.

 "The first," he says, "lies in the sizes of the vocal cavities such as the
throat, nasal and oral cavities and the shape, length and tension in an
individual's vocal cords located in the larynx. The vocal cavities are
resonators, much like organ pipes, which reinforce some of the overtones
produced by the vocal cords, which produce formats or voiceprint bars.
 The likelihood that two people would have exactly the same size and
configuration (is) very remote."
 The second factor in determining voice uniqueness is the manner in
which the "articulators" or muscles of speech are manipulated when an
individual is talking. The articulators include the lips, teeth, tongue, soft
palate and jaw muscles, "whose controlled interplay"— Cain explains —
"produces the second factor in determining voice uniqueness is the
manner in which the "articulators" or muscles of speech are manipulated
when an individual is talking.
 The articulators include the lips, teeth, tongue, soft palate and jaw
muscles, "whose controlled interplay"— Cain explains — "produces
intelligible speech...The likelihood that two persons could develop
identical use patterns of their articulators also appears to be very remote."

 While Cain agrees that "there is disagreement
in the so-called 'scientific community' on the
degree of accuracy with which examiners can
identify speakers under all conditions, there is
agreement that voices can, m fact, be
identified."

 GMM
 For acquiring the results the speech signal is
recoded. The system is trained for multiple
words such as Samosa, Dosa , Tea etc.
 The results for the word Samosa are shown.
 The speech signal which is recorded for the
word Samosa

 Short duration samples are more demanding and should be carefully
analysed.
 Dissimilarity in the language of questioned and specimen voice samples.
 Emotion variability in questioned and specimen sample.
 Misspoken or misread prompted phrases.
 Poorly recorded/noisy samples are difficult to analyse.
 Insufficient number of comparable words.
 Disguise in speech samples poses s problem in speaker identification.
 Extreme emotional state.
 Change in physical state of speaker (e.g. effect of alcohol).
 The attitude of how the speech is said by the speaker.
 Channel mismatch or mismatch in recording condition.
 Different pronunciation speed of the test data compared with the training
data.
 Speaker’s health.
 Aging (the vocal tract can drift away from models with age).

 Thus we are able to recognize multiple words such as Samosa, Dosa, Tea
and is converted into text by using this paper.
 This system is suitable with an environment with less ambient noise.
 The system provides good performance with respect to other systems.
 It can be concluded that GMM provides more accuracy.
 In lieu of the above discussion, it can be inferred that the comparison of
voice samples is quite complicated but absolutely possible.
 The skill of an examiner itself along with chosen parameters and selection
of appropriate technique for identification is largely decisive and can
facilitate accurate and conclusive results.
 There have been many advancements and success made in this field,
however, much remains to be done in order to overpower the daunting
limitations which still prevails and limits the process.
 If we successfully overcome all such limitations, this technique with its
promising features will have an obvious advantage over the pre-existing
ones for establishing individual identity

1. C. Champod and D. Meuwly, The inference of identity in forensic
speaker recognition, Speech Communication, vol. 31, pp. 193-203,
2000.
2. Reynolds, D.A., Rose, R.C.: Robust Text-Independent Speaker
Identification using Gaussian Mixture Speaker Models. IEEE
Transactions on Acoustics, Speech, and Signal Processing 3(1) (1995)
72–83
3. Zetterholm E (2007) Detection of speaker characteristics using voice
imitation. Springer Berlin Heidelberg 4441: 192-205.
4. Braun A, Kunzel HJ (1998) Is forensic speaker identification
unethical - or can it be unethical not to do it?. forensic linguistics 5:
10-21.
5. Kent RD, Read C (2001) The acoustic analysis of speech. university
of Wisconsin- Madison, A.I.T.B.S Publishers and distributors, Delhi.
6. Samudravijaya K (2003) Speech and speaker recognition: a tutorial.
Tata institute of fundamental research, Mumbai.

7. YA (2000) A research paper in forensic science. the university of Auckland,
New Zealand.
8. Gfroerer S (2003) Auditory-instrumental forensic speaker recognition.
Eurospeech, Geneva.
9. Harmegnies B, Landercy A (1988) Intra-speaker variability of the long
term speech pattern. Speech communication 7: 81-86.
10. Kekre HB, Sarode TK (2008) Speech data compression using vector
quantization. International journal of computer and information science
and engineering 2:8.
11. Yamato J, Ohya J, Ishii K (1992) Recognizing human action in time
sequential images using hidden markov model. IEEE: 379-385.
12. Abdulla WH, Kasabov NK (1999) The concepts of hidden markov model
in speech recognition. Information Science Discussion Papers 99/09,
university of Otago, New Zealand: 1-40.
13. Bennani Y, Gallinari P (1995) Neural networks for discrimination and
modelization of speakers. Speech communication 17: 159-175.
14. Nakasone H, Beck SD (2001) Forensic automatic speaker identification.
paper presented at- a speaker odyssey, Crete, Greece.
15. Zetterholm E (2007) Detection of speaker characteristics using voice
imitation. Springer Berlin Heidelberg 4441: 192-205.

16.https://www.deepdyve.com/lp/elsevier/interpol-survey-of-
the-use-of-speaker-identification-by-law-
vewW0By1LL?viewMode=multi#bsSignUpModal.
.17.https://indiankanoon.org/

voice recognition

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to voice recognition

Similar to voice recognition (20)

More from Hemant Jain

More from Hemant Jain (10)

Recently uploaded

Recently uploaded (20)

voice recognition