Speaker recognition is the computing task ofvalidating a user’s claimed identity usingcharacteristics extracted from their voices.Speaker recognizes who is speaking, where asspeech recognition recognizes what is being said.Voice recognition is a combination of the twowhere it uses learned aspects of a speakers voiceto determine what is being said.
Speaker verification has co-evolved with the technologies ofspeech recognition and speech synthesis (TTS) because of thesimilar characteristics and challenges associated with each.1960 - Gunnar Fant, a Swedish professor published amodel describing the physiological components ofacoustic speech production, based on the analysisof x-rays of individuals making specified phonicsounds.1970 – Dr. Joseph Perkell used motion x-rays andincluded the tongue and jaw to expand on Fant’smodel. Original speaker recognition systems usedthe average output of several analog filters toperform matching – often aided by humans.
1976 – Texas Instruments built a prototype systemthat was tested by the U.S. Air Force and The MITRECorporation.Mid 1980s – The National Institute of Standards andTechnology (NIST) developed the NIST SpeechGroup to study and promote the use of speechprocessing techniques.Since 1996 – Under funding from the NSA, the NISTSpeech Group has hosted yearly evaluations, theNIST Speaker Recognition Workshop, to foster thecontinued advancement of the speaker recognitioncommunity.
The physiological component of voice recognition is related to the physical shape of an individuals vocal tract, which consists of an airway and the soft tissue cavities from which vocal sounds originate. The acoustic patterns of speech come from the physical characteristics of the airways. Motion of the mouth and pronunciations are the behavioral components of this biometric.This source sound isaltered as it travelsthrough the vocaltract, configureddifferently basedon the position ofthe tongue, lips,mouth, andpharynx.
Speech samples are waveforms with time on thehorizontal axis and loudness on the vertical access. Thespeaker recognition system analyzes the frequencycontent of the speech and compares characteristicssuch as the quality, duration, intensity, dynamics, andpitch of the signal.
r eh k ao g n ay z s p iy ch "recognize speech"r eh k ay n ay s b iy ch "wreck a nice beach"
Two major applications of speaker recognitiontechnologies and methodologies exist.Speaker authentication or verification is the task ofvalidating the identity the speaker claims to be.The verification is a 1:1 match where one speaker’svoice is matches against one template (called“voice print” or “voice model”).Speaker identification is the task of determining anunknown speaker’s identity. Identification is a 1:Nmatch where it is compared against N templates.
Text-Dependent require the speaker to provideutterances (speak) of key words or sentences, thesame text being used for both training andrecognition.Text-Independent is when predetermined key wordscannot be used. Human beings recognize speakersirrespective of the content of the utterance.Text-Prompted Methods prompts each user with anew key sentence every time the system is used.
How can speaker recognitions normalize thevariation of likelihood values in speaker verification?In order to compensate for the variations, two typesof normalization techniques have been tried:parameter domain, and likelihood domain.Adaptation of the reference model as well as theverification threshold for each speaker isindispensable to maintaining a high recognitionaccuracy over a long period.
Parameter domainSpectral equalization (“blind equalization”) has beenconfirmed to be effective in reducing linear channeleffects and long-term spectral variation. This method isespecially effective for text-dependent speakerrecognition applications using sufficiently longutterances.Likelihood domainRatio is the conditional probability of the observedmeasurements of the utterance given the claimedidentity is correct, to the conditional probability of theobserved measurements given the speaker is animpostor.Posteriori probability method is calculated by using a setof speakers including the claimed speaker.
1) The quality/duration/loudness/pitch features are extracted from the submitted sample.2) The extracted sample is compared to the claimed identity and other models. The other-speakers models contain the “states” of a variety of individuals, not including that of the claimed identity.3) The input voice sample and enrolled models are compared to produce a “likelihood ratio”, indicating the likelihood of the input sample came from the claimed speaker.
How to update speaker models to cope with the gradualchanges in people’s voices.It is necessary to build each speaker model based on asmall amount of data collected in a few sessions, andthen the model must be updated using speech datacollected when the system is used.The reference template for each speaker is updated byaveraging new utterances and the present templateafter time registration.These methods have been extended and applied totext-independent and text-prompted speakerverification using HMMs.
Hidden Markov Models (HMMs) are random basedmodel that provides a statistical representation of thesounds produced by the individual. The HMM representsthe underlying variations and temporal changes overtime found in the speech states usingquality/duration/intensity dynamics/pitchcharacteristics.Guassian Mixture Model (GMM) is a state-mappingmodel closely related to HMM, often used for “text-independent”. Uses the speaker’s voice to create anumber of vector “states” representing the varioussound forms. These methods all compare the similaritiesand differences between the input voice and the storesvoice “states” to produce a recognition decision.
Some companies use voiceprint recognition so peoplecan gain access to information or give authorizationwithout being physically present. Instead of stepping up to an iris scanner or handgeometry reader, someone can give authorization bymaking a phone call. Unfortunately, people can bypass some systems,particularly those that work by phone, with a simplerecording of an authorized persons password. Thats whysome systems use several randomly-chosen voicepasswords or use general voiceprints instead of prints forspecific words.
Except for text-promoted systems, speaker recognitionare susceptible to spoofing attacks through the use ofrecorded voice. Text-dependent systems are less suitable for public use. Noise in the background can be disruptive, althoughequalizers may be used to fix this problem. Text-independent is currently under research, althoughmethods have been proposed calculating the rhythm,speed, modulation, and intonation, based on personalitytype and parental influence. Authentication is based on ratio and probability. Frequent enrollment needs to happen to deal withvoice changes. Someone who is deaf or mute can’t use this type ofbiometrics.
All you need is software and a microphone. Many methods have been proposed: Text-Dependent DTW-Based Methods HMM-Based Methods Text-Independent Long-Term-Statistics-Based Methods VQ-Based Methods Ergodic-HMM-Based Methods Speech-Recognition-Based Methods Fast authentication. Give someone else authentication.
Speaker recognition. Retrieved October 20, from Wikipediaweb site: http://en.wikipedia.org/wiki/Speaker_recognitionSadoki, Dr. F. (2008). Speaker Recognition. Retrieved October20, from Scholarpedia web site:http://www.scholarpedia.org/article/Speaker_recognition#DTW-Based_MethodsThe Speaker Recognition Homepage. Retrieved October 20,from speaker-recognition web site: http://www.speaker-recognition.org/(2006). Speaker Recognition. Retrieved October 20, frombiometrics web site:http://www.biometrics.gov/Documents/SpeakerRec.pdfHowstuffworks “How speech recognition works”. RetrievedOctober 21, from howstuffworks web site:http://electronics.howstuffworks.com/gadgets/high-tech-gadgets/speech-recognition.htm/printableWilson. T. Howstuffworks “Voiceprints”. Retrieved October 21,from howstuffworks web site:http://science.howstuffworks.com/biometrics3.htm