YELLAPUMADHURIAUTOMATIC LANGUAGE TRANSLATION SOFTWARE FORAIDING COMMUNICATION BETWEEN INDIAN SIGNLANGUAGE AND SPOKEN ENGLISH USING LABVIEWA PROJECT REPORTSubmitted in partial fulfillment for the award of the degree ofMASTER OF TECHNOLOGYinBIOMEDICAL ENGINEERINGbyYELLAPU MADHURI (1651110002)Under the Guidance ofMs. G. ANITHA,(Assistant Professor)DEPARTMENT OF BIOMEDICAL ENGINEERINGSCHOOL OF BIOENGINEERINGFACULTY OF ENGINEERING & TECHNOLOGYSRM UNIVERSITY(Under section 3of UGC Act, 1956)SRM Nagar, Kattankulathur-603203Tamil Nadu, IndiaMAY 2013
YELLAPUMADHURI2ACKNOWLEDGEMENTFirst and foremost, I express my heartfelt and deep sense of gratitude to our ChancellorShri. T. R. Pachamuthu, Shri. P. Ravi Chairman of the SRM Group of Educational Institutions,Prof. P.Sathyanarayanan, President, SRM University, Dr. R.Shivakumar, Vice President, SRMUniversity, Dr. M.Ponnavaikko, Vice Chancellor, for providing me the necessary facilities for thecompletion of my project. I also acknowledge Registrar Dr. N. Sethuraman for his constant supportand endorsement.I wish to express my sincere gratitude to our Director (Engineering & Technology) Dr. C.Muthamizhchelvan for his constant support and encouragement.I am extremely grateful to the Head of the Department Dr. M. Anburajan for his invaluableguidance, motivation, timely and insightful technical discussions. I am immensely grateful for hisconstant encouragement, smooth approach throughout our project period and make this workpossible.I am indebted to my Project Co-coordinators Mrs. U.Snekhalatha and Mrs. VarshiniKarthik for their valuable suggestions and motivation. I am deeply indebted to my Internal GuideMs. G. Anitha, and faculties of Department of Biomedical Engineering for extending their warmsupport, constant encouragement and ideas they shared with us.I would be failing in my part if I do not acknowledge my family members and my friendsfor their constant encouragement and support.
YELLAPUMADHURI3BONAFIDE CERTIFICATEThis is to certify that the Project entitled "AUTOMATIC LANGUAGE TRANSLATIONSOFTWARE FOR AIDING COMMUNICATION BETWEEN INDIAN SIGN LANGUAGEAND SPOKEN ENGLISH USING LABVIEW" has been carried out by YELLAPUMADHURI-1651110002 under the supervision of Ms. G. Anitha in partial fulfillment of thedegree of MASTER OF TECHNOLOGY in Biomedical Engineering, School of Bioengineering,SRM University, during the academic year 2012-2013(Project Work Phase –II, Semester -IV). Thecontents of this report, in full or in parts, have not been submitted to any institute or university forthe award of any degree or diploma.Signature SignatureHEAD OF THE DEPARTMENT INTERNAL GUIDE(Dr. M. Anburajan) (Ms. G. Anitha)Department of Biomedical Engineering, Department of Biomedical Engineering,SRM University, SRM University,Kattankulathur – 603 203. Kattankulathur – 603 203.INTERNAL EXAMINER EXTERNAL EXAMINER
YELLAPUMADHURI4ABSTRACTThis paper presents SIGN LANGUAGE TRANSLATION software for automatic translationof Indian sign language into spoken English and vice versa to assist the communication betweenspeech and/or hearing impaired people and hearing people. It could be used by deaf community asa translator to people that do not understand sign language, avoiding by this way the interventionof an intermediate person for interpretation and allow communication using their natural way ofspeaking. The proposed software is standalone executable interactive application programdeveloped using LABVIEW software that can be implemented in any standard windows operatinglaptop, desktop or an IOS mobile phone to operate with the camera, processor and audio device.For sign to speech translation, the one handed Sign gestures of the user are captured usingcamera; vision analysis functions are performed in the operating system and providecorresponding speech output through audio device. For speech to sign translation the speech inputof the user is acquired by microphone; speech analysis functions are performed and provide signgesture picture display of corresponding speech input. The experienced lag time for translation islittle because of parallel processing and allows for instantaneous translation from finger and handmovements to speech and speech inputs to sign language gestures. This system is trained totranslate one handed sign representations of alphabets (A-Z), numbers (1-9) to speech and 165word phrases to sign gestures The training database of inputs can be easily extended to expand thesystem applications. The software does not require the user to use any special hand gloves. Theresults are found to be highly consistent, reproducible, with fairly high precision and accuracy.
YELLAPUMADHURI5TABLE OF CONTENTSCHAPTER NO. TITLE PAGE NO.ABSTRACT ILIST OF FIGURES IVLIST OF ABBREVIATIONS VI1 INTRODUCTION 11.1 HEARING IMPAIRMENT 21.2 NEED FOR THE SYSTEM 71.3 AVAILABLE MODELS 81.4 PROBLEM DEFINITION 81.5 SCOPE OF THE PROJECT 91.6 FUTURE PROSPECTS 101.7 ORGANISATION OF REPORT 102 AIM AND OBJECTIVESOF THE PROJECT 112.1 AIM 112.2 OBJECTIVES 113 MATERIALS AND METHODOLOGY 123.1 SIGN LANGUAGE TO SPOKEN 13ENGLISH TRANSLATION3.2 SPEECH TO SIGN LANGUAGE 21TRANSLATOR
YELLAPUMADHURI7LIST OF FIGURESFIGURE PAGE NO.1.1 Anatomy of human ear 31.2 Events involved in hearing 31.3 Speech chain 41.4 Block diagram of speech chain 43.1 Graphical abstract 123.2 Flow diagram of template preparation 163.3 Flow diagram of pattern matching 193.4 Block diagram of sign to speech translation 213.5 Flow diagram speech to sign translation 253.6 Block diagram speech to sign translation 243.7 Speech recognizer tutorial window 254.1 Application Installer 324.2 Application window 334.3 GUI of Speech to Sign translation 344.4 Speech recognizer in sleep mode 364.5 Speech recognizer in active mode 364.6 Speech recognizer when input speech is not clear for recognition 364.7 GUI of working window of speech to sign translation 374.8 Block diagram of speech to sign translation 37
YELLAPUMADHURI84.9 GUI of template preparation 384.10 Block diagram of sign to speech translation 384.11 GUI of working window of template preparation 394.12 GUI of sign to speech translation 404.13 GUI of working window of sign to speech translation 414.14 Block diagram of sign to speech translation 424.15 Block diagram of pattern matching 424.16 Data base of sign templates 464.17 Data base of sign number templates 47
YELLAPUMADHURI9LIST OF ABBREVIATIONSSr.No ABBREVIATION EXPANSION1 SL Sign language2 BII Bahasa Isyarat India3 SLT Sign language translator4 ASLR Automatic sign language recognition5 ASLT Automatic sign language translation6 GSL Greek Sign Language7 SDK Software development kit8 RGB Red green blue9 USB Universal serial bus10 CCD Charge couple display11 ASL American sign language12 ASR Automatic sign recognition13 HMM Hidden Markov model14 LM Language model15 OOV Out of vocabulary
YELLAPUMADHURI101. INTRODUCTIONIn India there are around 60 million people with hearing deficiencies. Deafness brings aboutsignificant communication problems: most deaf people have serious problems when expressingthemselves in these languages or understanding written texts. This fact can cause deaf people tohave problems when accessing information, education, job, social relationship, culture, etc. It isnecessary to make a difference between “deaf” and “Deaf”: the first one refers to non-hearingpeople, and the second one refers to non-hearing people who use a sign language to communicatebetween themselves (their mother tongue), making them part of the “Deaf community”. Signlanguage is a language through which communication is possible without the means of acousticsounds. Instead, sign language relies on sign patterns, i.e., body language, orientation andmovements of the arm to facilitate understanding between people. It exploits unique features of thevisual medium through spatial grammar. Sign languages are fully-fledged languages that have agrammar and lexicon just like any spoken language, contrary to what most people think. The use ofsign languages defines the Deaf as a linguistic minority, with learning skills, cultural and grouprights similar to other minority language communities.Hand gestures can be used for natural and intuitive human-computer interaction fortranslating sign language to spoken language to assist communication of deaf community with nonsign language users. To achieve this goal, computers should be able to recognize hand gesturesfrom input. Vision-based gesture recognition can achieve an improved interaction, more intuitiveand flexible for the user. However vision-based hand tracking and gesture recognition is anextremely challenging problem due to the complexity of hand gestures, which are rich in diversitiesdue to high degrees of freedom involved by the human hand. On the other hand, computer visionalgorithms are notoriously brittle and computation intensive, which make most current gesturerecognition systems fragile and inefficient. This report proposes a new architecture to solve theproblem of real-time vision-based hand tracking and gesture recognition. To recognize differenthand postures, a parallel cascades structure is implemented. This structure achieves real-timeperformance and high translation accuracy. The 2D position of the hand is recovered according tothe camera’s perspective projection. To make the system robust against cluttered backgrounds,
YELLAPUMADHURI11background subtraction and noise removal are applied. The overall goal of this project is to developa new vision-based technology for recognizing and translating continuous sign language to spokenEnglish and vice-versa.1.1 HEARING IMPAIRMENTHearing is one of the major senses and is important for distant warning and communication.It can be used to alert, to communicate pleasure and fear. It is a conscious appreciation of vibrationperceived as sound. In order to do this, the appropriate signal must reach the higher parts of thebrain. The function of the ear is to convert physical vibration into an encoded nervous impulse. Itcan be thought of as a biological microphone. Like a microphone the ear is stimulated by vibration:in the microphone the vibration is transduced into an electrical signal, in the ear into a nervousimpulse which in turn is then processed by the central auditory pathways of the brain. Themechanism to achieve this is complex.The ears are paired organs, one on each side of the head with the sense organ itself, which istechnically known as the cochlea, deeply buried within the temporal bones. Part of the ear isconcerned with conducting sound to the cochlea; the cochlea is concerned with transducingvibration. The transduction is performed by delicate hair cells which, when stimulated, initiate anervous impulse. Because they are living, they are bathed in body fluid which provides them withenergy, nutrients and oxygen. Most sound is transmitted by a vibration of air. Vibration is poorlytransmitted at the interface between two media which differ greatly in characteristic impedance.The ear has evolved a complex mechanism to overcome this impedance mismatch, known as thesound conducting mechanism. The sound conducting mechanism is divided into two parts, an outerand the middle ear, an outer part which catches sound and the middle ear which is an impedancematching device. Sound waves can be distinguished from each other by means of the differences intheir frequencies and amplitudes. For people suffering from any type of deafness, these differencescease to exist. The anatomy of the ear and the events involved in hearing process are shown infigure 1.1 and figure 1.2 respectively.
YELLAPUMADHURI12Figure 1.1 Anatomy of human earFigure 1.2 Events involved in hearing
YELLAPUMADHURI141.1.1 THE SPEECH SIGNALWhile you are producing speech sounds, the air flow from your lungs first passes the glottisand then your throat and mouth. Depending on which speech sound you articulate, the speechsignal can be excited in three possible ways:• VOICED EXCITATIONThe glottis is closed. The air pressure forces the glottis to open and close periodically thusgenerating a periodic pulse train (triangle–shaped). This ”fundamental frequency” usually lies inthe range from 80Hz to 350Hz.• UNVOICED EXCITATIONThe glottis is open and the air passes a narrow passage in the throat or mouth. This results ina turbulence which generates a noise signal. The spectral shape of the noise is determined by thelocation of the narrowness.• TRANSIENT EXCITATIONA closure in the throat or mouth will raise the air pressure. By suddenly opening the closurethe air pressure drops down immediately. (”plosive burst”) With some speech sounds these threekinds of excitation occur in combination. The spectral shape of the speech signal is determined bythe shape of the vocal tract (the pipe formed by your throat, tongue, teeth and lips). By changingthe shape of the pipe (and in addition opening and closing the air flow through your nose) youchange the spectral shape of the speech signal, thus articulating different speech sounds.An engineer looking at (or listening to) a speech signal might characterize it as follows:• The bandwidth of the signal is 4 kHz• The signal is periodic with a fundamental frequency between 80 Hz and 350 Hz
YELLAPUMADHURI15• There are peaks in the spectral distribution of energy at (2n − 1) ∗ 500 Hz ; n = 1, 2, 3, . . . (1.1)• The envelope of the power spectrum of the signal shows a decrease with increasing frequency(-6dB per octave).1.1.2 CAUSES OF DEAFNESS IN HUMANSMany speech and sound disorders occur without a known cause. Some speech- sound errorscan result from physical problems such as• Developmental disorders• Genetic syndromes• Hearing loss• Illness• Neurological disordersSome of the major types are listed below.• Genetic Hearing Loss• Conductive Hearing Loss• Perceptive Hearing Loss• Pre-Lingual Deafness• Post-Lingual Deafness• Unilateral Hearing Loss1. In some cases, hearing loss or deafness is due to hereditary factors. Genetics is considered toplay a major role in the occurrence of sensory neural hearing loss. Congenital deafness canhappen due to heredity or birth defects.
YELLAPUMADHURI162. Causes of Human deafness include continuous exposure to loud noises. This is commonlyobserved in people working in construction sites, airports and nightclubs. This is alsoexperienced by people working with firearms and heavy equipment, and those who use musicheadphones frequently. The longer the exposure, the greater is the chance of getting affected byhearing loss and deafness.3. Some diseases and disorders can also be a contributory factor for deafness in humans. Thisincludes measles, meningitis, some autoimmune diseases like Wegeners granulomatosis,mumps, presbycusis, AIDS and Chlamydia. Fetal alcoholic syndrome, developed in babiesborn to alcoholic mothers, can cause hearing loss in infants. Growing adenoids can also causehearing loss by obstructing the Eustachian tube. Otosclerosis, which is a disorder of the middleear bone, is another cause of hearing loss and deafness. Likewise, there are many other medicalconditions which can cause deafness in humans.4. Some medications are also considered to be the cause of permanent hearing loss in humans,while others can lead to deafness which can be reversed. The former category includesmedicines like gentamicin and the latter includes NSAIDs, diuretics, aspirin and macrolideantibiotics. Narcotic pain killer addiction and heavy hydrocodone abuse can also causedeafness.5. Human deafness causes include exposure to some industrial chemicals as well. These ototoxicchemicals can contribute to hearing loss if combined with continuous exposure to loud noise.These chemicals can damage the cochlea and some parts of the auditory system.6. Sometimes, loud explosions can cause deafness in humans. Head injury is another cause fordeafness in humans.The above are some of the common causes of deafness in humans. There can be many otherreasons which can lead to deafness or hearing loss in humans. It is always advisable to protect theears from trauma and other injuries, and to wear protective gear in workplaces, where there arecontinuous heavy noises.1.2 NEED FOR THE SYSTEM
YELLAPUMADHURI17Deaf communities revolve around sign languages as they are their natural means ofcommunication. Although deaf, hard of hearing and hearing signers can communicate withoutproblems amongst themselves, there is a serious challenge for the deaf community in trying tointegrate into educational, social and work environments. An important problem is that there arenot enough sign-language interpreters. In India, there are 60 million Deaf people (who use a signlanguage), although there are more people with hearing deficiencies, but only 7000 sign-languageinterpreters, i.e. a ratio of 893 deaf people to 1 interpreter. This information shows the need todevelop automatic translation systems with new technologies for helping hearing and Deaf peopleto communicate between themselves.1.3 AVAILABLE MODELSPrevious approaches have focused on recognizing mainly the hand alphabet which is used tofinger spell words and complete signs which are formed by dynamic hand movements. So far bodylanguage and facial expressions have been left out. Hand gesture recognition can be achieved bytwo ways: video-based and instrumented.The video based systems allow the signer to move freely without any instrumentationattached to the body. The hand shape, location and movement are recognized by cameras. But thesigner is constrained to sign in a controlled environment. The amount of data to be processed in theimage imposes a restriction on memory, speed and complexity on the computer equipment.For instrumented approaches require sensors to be placed on signer’s hands. They arerestrictive and cumbersome but more successful in recognizing hand gestures than video basedapproaches.1.4 PROBLEM DEFINITIONSign language is very complex with many actions taking place both sequentially andsimultaneously. Existing translators are bulky, slow and not precise due to the heavy parallelprocessing required. The cost of these translators is usually very high due to the hardware requiredto meet the processing demands. There is an urgent requirement for a simple, precise and
YELLAPUMADHURI18inexpensive system that helps to bridge the gap for normal people who do not know sign languageand deaf persons who communicate through sign language, who are unfortunately in significantlylarge number in a country such as India.In this project, the aim is to detect single-hand gestures in two dimensional space, using avision based system and speech input through microphone. The selected features should be assmall as possible in number, invariant to input errors like vibrating hand, small rotation, scale,pitch and voice which may vary from person to person or with different input devices and provideaudio output through speaker and visual output on display device. The acceptable delay of thesystem is the end of each gesture, meaning that the pre-processing should be in real-time. One ofour goals in the design and development of this system is scalability in detecting a reasonablenumber of gestures and words and the ability to add new gestures and words in the future.1.5 SCOPE OF THE PROJECTThe developed software is a stand alone application. It can be installed in any standard PC orIOS phone and implemented. It can be used in a large variety of environments like shops,governmental offices and also for the communication between a deaf user and information systemslike vendor machines or PCs. Below are the scopes that to be proposed for this project.For sign to speech translation:i. To develop an image acquisition system that automatically acquires images when triggered, for afixed interval of time or when the gestures are present.ii. To develop a set of definition of gestures and processes of filtration, effect and functionavailable.iii. To develop a pre-defined gestures algorithm that command computer to do playback function ofaudio model.iv. To develop a testing system that proceeds to command if the condition is true with theprocessed images.vi. To develop a simple Graphical User Interface for input and indication purposes.
YELLAPUMADHURI19For speech to sign translation:i. To develop a speech acquisition system that automatically acquires speech input when triggered,for a fixed interval of time or when the gestures are present.ii. To develop a set of definition of phonemes and processes of filtration, effect and functionavailable.iii. To develop a pre-defined phonetics algorithm that command computer to display function ofsign model.iv. To develop a testing system that proceeds to command if the condition is true with theprocessed phonemes.vi. To develop a simple Graphical User Interface for input and indication purposes.1.6 FUTURE PROSPECTSFor sign language to spoken English translation, the software is able to translate only staticsigns to spoken English currently. It can be extended to translate dynamic signs. Also facialexpressions and body language can be tracked and considered which improves the performance ofthe sign language to spoken English translation. For spoken English to sign language translation,the system can be made user voice specific to eliminate the system response to non user.1.7 ORGANISATION OF REPORTThis report is composed of 6 chapters each will give of details upon every aspect of thisproject. The beginning of this report will explain on what foundation the system to be built on. Thisincludes Chapter 1 as the introduction of the whole report. The preceding chapter 2 will containdiscussion on related work. Next, chapter 3 will provide the aim and objective of the project.Chapter 4 will explain the materials and methodology followed to achieve the aim and objective ofthe project and have a complete setup application. This chapter starts with overview on keycomponent of software, hardware and how both should cooperate. Then it is followed with afurther look on the overall system built. These topics will detail out everything under the interest ofthe system. The chapter 5 starts with results and discussion of the system and its performance with
YELLAPUMADHURI20the results of various stages of implementation. This report will properly be concluded in the lastChapter 6. The conclusion discusses briefly what the proposed system has accomplished andprovides an outlook for future work recommended for the extension of this project and futureprospect for the development and improvement to grow on.2. AIM AND OBJECTIVES OF THE PROJECT2.1. AIMTo develop a mobile interactive application program for automatic translation of Indian signlanguage into spoken English and vice-versa to assist the communication between Deaf people andhearing people. The sign language translator should be able to translate one handed Indian Signlanguage finger spelling input of alphabets (A-Z) and numbers (1-9) to spoken English audiooutput and 165 spoken English word input to Indian Sign language picture display output.2.2. OBJECTIVES• To acquire one hand finger spelling of alphabets (A to Z) and numbers (1 to 9) to producespoken English audio output.• To acquire spoken English word input to produce Indian Sign language picture display output.• To create an executable file to make the software a standalone application.• To implement the software and optimize the parameters to improve the accuracy of translation.• To minimize hardware requirements and thus expense while achieving high precision oftranslation.
YELLAPUMADHURI213. MATERIALS AND METHODOLOGYThis chapter will be dedicated to explain the system in great details from setting up, to thesystem component to the output. The software is developed in Virtual Instrumentation Labviewplatform. The software consists of two main parts namely Sign language to speech translation andspeech to Sign language translation. The software can be implemented using a standard laptop,desktop or an IOS mobile phone to operate with the camera, processor and audio device.
YELLAPUMADHURI22Figure 3.1 Graphical abstractThe software consists of four modules that can be implemented from a single window. Thenecessary steps to implement these modules from a single window are explained in detail below.3.1 SIGN LANGUAGE TO SPOKEN ENGLISH TRANSLATIONThe sign language to spoken English translation is achieved using pattern matchingtechnique. The complete interactive section can be considered to be comprised of two layers:detection and recognition. The detection layer is responsible for defining and extracting visualfeatures that can be attributed to the presence of hands in the field of view of the camera. The
YELLAPUMADHURI23recognition layer is responsible for grouping the spatiotemporal data extracted in the previouslayers and assigning the resulting groups with labels associated to particular classes of gestures.3.1.1 DETECTIONThe primary step in gesture recognition systems is the detection of hands and thecorresponding image regions. The step is crucial because it isolates the task-relevant data from theimage background, before passing them to the subsequent tracking and recognition stages. A largenumber of methods have been proposed in the literature that utilize a several types of visualfeatures and, in many cases, their combination. Such features are skin color, shape, motion andanatomical models of hands are used. Several color spaces have been proposed including RGB,normalized RGB, HSV, YCrCb, YUV, etc. Color spaces efficiently separating the chromaticityfrom the luminance components of color are typically considered preferable. This is due to the factthat by employing chromaticity-dependent components of color only, some degree of robustness toillumination changes can be achieved.Template based detection is used here. Members of this class invoke the hand detector at thespatial vicinity that the hand was detected in the previous frame, so as to drastically restrict theimage search space. The implicit assumption for this method to succeed is that images are acquiredfrequently enough. The proposed technique is explained in the following intermediate stepsnamely; image acquisition, image processing, template preparation and pattern recognition.3.1.2 IMAGE ACQUISITIONThe software is installed in any supporting operating system with access to camera andmicrophone. After installing the executable file, follow the instructions that appear on theGraphical User Interface (GUI) and execute the program. The program allows the user to choosethe camera. All the cameras which are allowed to access through the operating system, eitherinbuilt camera or connected externally appear in the selection list. After choosing the camera, thesoftware sends commands to the camera to capture the gestures of sign language performed by theuser. Image acquisition process is subjected to many environmental concerns such as the position
YELLAPUMADHURI24of the camera, lighting sensitivity and background condition. The camera is placed to focus on anarea that can capture the maximum possible movement of the hand and take into account thedifference in height of individual signers. Sufficient lighting is required to ensure that the acquiredimage is bright enough to be seen and analyzed. Capturing thirty frames per second (fps) is foundto be sufficient. Higher fps will only lead to higher computation time of the computer as more inputdata to be processed. As the acquisition process runs at real time, this part of the process has to beefficient. The acquired images are then processed. The previous frame that has been processed willbe automatically deleted to free the limited memory space in the buffer.3.1.3 IMAGE PROCESSINGThe captured images are processed to identify the unique features of each sign. Imageprocessing enhances the features of interest for recognition of the sign. The camera capturesimages at 30 frames per second. At this rate, the difference between subsequent images will be toosmall. Hence, the images are sampled at 5 frames per second. In the program, one frame is savedand numbered sequentially every 200 milliseconds so that the image classifying and processing canbe done systematically. The position of the hand is monitored. The image acquisition runscontinuously until the acquisition is stopped. The image processing involves performingmorphological operations on the input images to enhance the unique features of each sign. As theframes from acquisition are read one by one, they are subjected to extraction of single color planeof luminance.3.1.4 TEMPLATE PREPARATIONThe images to be used for pattern matching as templates are prepared using the followingprocedure and are saved in a folder to be later used in the pattern matching.1. Open CameraOpen the camera, query the camera for its capabilities, load the camera configuration file, andcreate a unique reference to the camera.2. Configure Acquisition
YELLAPUMADHURI25Configure a low-level acquisition previously opened with IMAQdx Open Camera VI. Specifythe acquisition type with the Continuous and Number of Buffers parameters. Snap: Continuous= 0; Buffer Count = 1Sequence: Continuous = 0; Buffer Count > 1Grab: Continuous = 1; Buffer Count > 13. Start AcquisitionStart an acquisition that was previously configured with the IMAQdx Configure Acquisition.4. CreateCreate a temporary memory location for an image.5. Get ImageAcquire the specified frame into Image Out. If the image type does not match the video formatof the camera, this VI changes the image type to a suitable format.6. Extract Single Color PlaneExtract a single plane from the color image.7. Setup Learn PatternSets parameters used during the learning phase of pattern matching.8. Learn PatternCreate a description of the template image for which you want to search during the matchingphase of pattern matching. This description data is appended to the input template image.During the matching phase, the template descriptor is extracted from the template image andused to search for the template in the inspection image.
YELLAPUMADHURI27Write the image to a file in the selected format.10. Close CameraStop the acquisition in progress, release resources associated with the acquisition, and close thespecified Camera Session.11. Merge Errors FunctionMerge error I/O clusters from different functions. This function looks for errors beginning withthe error in 0 parameter and reports the first error found. If the function finds no errors, it looksfor warnings and returns the first warning found. If the function finds no warnings, it returns noerror.3.1.5 IMAGE RECOGNITIONThe last stage of sign language to spoken English translation is the recognition stage andproviding the audio output. The techniques used for feature extraction should find shapes reliablyand robustly irrespective of changes in illumination levels, position, orientation and size of theobject in a video. Objects in an image are represented as collection of pixels. For object recognitionwe need to describe the properties of these groups of pixels. The description of an object is a set ofnumbers called as object’s descriptors. Recognition is simply matching a set of shape descriptorsfrom a set of known descriptors. A usable descriptor should possess four valuable properties. Thedescriptors should form a complete set, they should be congruent, rotation invariant and form acompact set. Objects in an image are characterized by two forms of descriptors region and shapedescriptors. Region descriptors describe the arrangement of pixels within the object area whereasshape descriptors are the arrangement of pixels in the boundary of the object.Template matching, a fundamental pattern recognition technique, has been utilized forgesture recognition. The template matching is performed by the pixel-by-pixel comparison of aprototype and a candidate image. The similarity of the candidate to the prototype is proportional tothe total score on a preselected similarity measure. For the recognition of hand postures, the imageof a detected hand forms the candidate image which is directly compared with prototype images ofhand postures. The best matching prototype (if any) is considered as the matching posture.The final stage of the system is classification of different signs and generating voice
YELLAPUMADHURI28messages corresponding to the correctly classified sign. The acquired images which arepreprocessed are read one by one and compared with template images saved in database for patternmatching. The pattern matching parameters are threshold value setting for the maximum differencebetween the input sign and the database, if the difference is below the maximum limit, a match isfound and the sign is recognized. Ideally it is set at 800. When an input image is matched with atemplate image, then the pattern matching loop will stop. The audio corresponding to the loopiteration value is used to play the audio through the inbuilt audio device. The necessary steps toachieve sign language to speech translation are given below.1. Invoke NodeInvoke a method or action on a reference.2. Default Values: Reinitialize All To Default MethodChange the current values of all controls on the front panel to their defaults.3. File Dialog ExpressDisplays a dialog box with which you can specify the path to a file or directory from existingfiles or directories or to select a location and name for a new file or directory.4. Open CameraOpens a camera, queries the camera for its capabilities, loads a camera configuration file, andcreates a unique reference to the camera.5. Configure GrabConfigures and starts a grab acquisition. A grab performs an acquisition that loops continuallyon a ring of buffers.6. CreateCreate a temporary memory location for an image.7. GrabAcquire the most current frame into Image Out.8. Extract Single Color PlaneExtract a single plane from a color image.9. Setup Learn PatternSets parameters used during the learning phase of pattern matching.
YELLAPUMADHURI29Figure 3.3 Flow diagram of pattern matching
YELLAPUMADHURI3010. Learn PatternCreate a description of the template image for which you want to search during the matchingphase of pattern matching. This description data is appended to the input template image.During the matching phase, the template descriptor is extracted from the template image andused to search for the template in the inspection image.11. Recursive File ListList the contents of a folder or LLB.12. Unbundle By Name FunctionReturns the cluster elements whose names you specify.13. Read FileRead an image file. The file format can be a standard format (BMP, TIFF, JPEG, JPEG2000,PNG, and AIPD) or a nonstandard format known to the user. In all cases, the read pixels areconverted automatically into the image type passed by Image.14. Call ChainReturn the chain of callers from the current VI to the top-level VI. Element 0 of the call chainarray contains the name of the lowest VI in the call chain. Subsequent elements are callers ofthe lower VIs in the call chain. The last element of the call chain array is the name of the top-level VI.15. Index ArrayReturn the element or sub-array of n-dimension array at index.16. Format Into StringFormats string, path, enumerated type, time stamp, Boolean, or numeric data as text.17. Read Image And Vision InfoRead an image file, including any extra vision information saved with the image. This includesoverlay information, pattern matching template information, calibration information, andcustom data, as written by the IMAQ Write Image and Vision Info File 2 instance of the IMAQWrite File 2 VI.18. Pattern Match AlgorithmCheck for the presence of the template image in the given input image.19. Speak TextCall the .NET speech synthesizer to speak a string of text.
YELLAPUMADHURI3120. DisposeDestroys an image and frees the space it occupied in memory. This VI is required for eachimage created in an application to free the memory allocated to the IMAQ Create VI.21. Simple Error HandlerIndicate whether an error occurred. If an error occurred, this VI returns a description of theerror and optionally displays a dialog box.Figure 3.4 Block diagram of sign to speech translation3.2 SPEECH TO SIGN LANGUAGE TRANSLATIONThe speech input is acquired through the inbuilt microphone using windows speechrecognition software. The system recognizes the speech input phrases that are listed in thedatabase. Each phrase in the database is associated with a picture of sign language gesture. If theinput speech matches the database then a command is sent to display the corresponding gesture.The necessary steps to achieve speech to sign language translation are given below.
YELLAPUMADHURI321. Current VI’s pathReturn the path to the file of the current VI.2. Strip PathReturn the name of the last component of a path and the stripped path that leads to thatcomponent.3. Build pathCreate a new path by appending a name (or relative path) to an existing path.4. VI Server ReferenceReturn a reference to the current VI or application, to a control or indicator in the VI, or to apane. You can use this reference to access the properties and methods for the associated VI,application, control, indicator, or pane.5. Property NodeGet (reads) and/or set (writes) properties of a reference. Use the property node to get or setproperties and methods on local or remote application instances, VIs, and objects.6. Speech Recognizer InitializeThe event rises when the current grammar has been used by the recognition engine to detectspeech and find one or more phrases with sufficient confidence levels.7. Event StructureHave one or more sub diagrams, or event cases, exactly one of which executes when thestructure executes. The Event structure waits until an event happens, then executes theappropriate case to handle that event.8. Read JPEG File VIRead the JPEG file and create the data necessary to display the file in a picture control.9. Draw Flattened Pixmap VIDraw a 1-, 4-, or 8-bit pixmap or a 24-bit RGB pixmap into a picture.10. 2D Picture ControlInclude a set of drawing instructions for displaying pictures that can contain lines, circles, text,and other types of graphic shapes.11. Simple Error Handler VIIndicate whether an error occurred. If an error occurred, this VI returns a description of theerror and optionally displays a dialog box.
YELLAPUMADHURI33Figure 3.5 Flow diagram speech to sign translation
YELLAPUMADHURI34Figure 3.6 Block diagram speech to sign translation3.2.1 SPEECH RECOGNITIONA speech recognition system consists of the following:• A microphone, for the person to speak into.• Speech recognition software.• A computer to take and interpret the speech.• A good quality soundcard for input and/or output.Voice-recognition software programs work by analyzing sounds and converting them to text.They also use knowledge of how English is usually spoken to decide what the speaker mostprobably said. Once correctly set up, the systems should recognize around 95% of what is said ifyou speak clearly. Several programs are available that provide voice recognition. These systemshave mostly been designed for Windows operating systems; however programs are also availablefor Mac OS X. In addition to third-party software, there are also voice-recognition programs builtin to the operating systems of Windows Vista and Windows 7. Most specialist voice applicationsinclude the software, a microphone headset, a manual and a quick reference card. You connect themicrophone to the computer, either into the soundcard (sockets on the back of a computer) or via aUSB or similar connection. The latest versions of Microsoft Windows have a built-in voice-recognition program called Speech Recognition. It does not have as many features as Dragon
YELLAPUMADHURI35NaturallySpeaking but does have good recognition rates and is easy to use. As it is part of theWindows operating system, it does not require any additional cost apart from a microphone.The input voice recognition is achieved using windows 7 inbuilt Speech Recognition software.When the program is started, instructions appear to setup the microphone and a tutorial beginswhich gives steps to proceed for user voice recognition.A computer doesnt speak your language, so it must transform your words into something itcan understand. A microphone converts your voice into an analog signal and feeds it to your PCssound card. An analog-to-digital converter takes the signal and converts it to a stream of digitaldata (ones and zeros). Then the software goes to work. While each of the leading speechrecognition companies has its own proprietary methods, the two primary components of speechrecognition are common across products. The first piece, called the acoustic model, analyzes thesounds of your voice and converts them to phonemes, the basic elements of speech. The Englishlanguage contains approximately 50 phonemes.Heres how it breaks down your voice: First, the acoustic model removes noise and unneededinformation such as changes in volume. Then, using mathematical calculations, it reduces the datato a spectrum of frequencies (the pitches of the sounds), analyzes the data, and converts the wordsinto digital representations of phonemes. The software operation is as explained below.Figure 3.7 Speech recognizer tutorial window
YELLAPUMADHURI363.2.2 STEPS TO VOICE RECOGNITION• ENROLMENTEverybody’s voice sounds slightly different, so the first step in using a voice-recognitionsystem involves reading an article displayed on the screen. This process, called enrolment, takesless than 10 minutes and results in a set of files being created which tell the software how youspeak. Many of the newer voice-recognition programs say this is not required, however it is stillworth doing to get the best results. The enrolment only has to be done once, after which thesoftware can be started as needed.• DICTATING AND CORRECTINGWhen talking, people often hesitates, mumble or slur their words. One of the key skills inusing voice-recognition software is learning how to talk clearly so that the computer can recognizewhat you are saying. This means planning what to say and then speaking in complete phrases orsentences. The voice-recognition software will misunderstand some of the words spoken, so it isnecessary to proofread and then correct any mistakes. Corrections can be made by using the mouseand keyboard or by using your voice. When you make corrections, the voice-recognition softwarewill adapt and learn, so that (hopefully) the same mistake will not occur again. Accuracy shouldimprove with careful dictation and correction.• INPUTThe first step in voice recognition (VR) is the input and digitization of the voice into VR-capable software. This generally happens via an active microphone plugged into the computer. Theuser speaks into the microphone, and an analog-to-digital converter (ADC) creates digital soundfiles for the VR program to work with.
YELLAPUMADHURI37• ANALYSISThe key to VR is in the speech analysis. VR programs take the digital recording and parse itinto small, recognizable speech bits called "phonemes," via high-level audio analysis software.(There are approximately 40 of these in the English language.)• SPEECH-TO-TEXTOnce the program has identified the phonemes, it begins a complex process of identificationand contextual analysis, comparing each string of recorded phonemes against text equivalents in itsmemory. It then accesses its internal language database and pairs up the recorded phonemes withthe most probable text equivalents.• OUTPUTFinally, the VR software provides a word output to the screen, mere moments after speaking. Itcontinues this process, at high speed, for each word spoken into its program. Speech recognitionfundamentally functions as a pipeline that converts PCM (Pulse Code Modulation) digital audiofrom a sound card into recognized speech. The elements of the pipeline are:1. Transform the PCM digital audio into a better acoustic representation.2. Apply a "grammar" so the speech recognizer knows what phonemes to expect. A grammarcould be anything from a context-free grammar to full-blown Language.3. Figure out which phonemes are spoken.4. Convert the phonemes into words.• TRANSFORM THE PCM DIGITAL AUDIOThe first element of the pipeline converts digital audio coming from the sound card into aformat thats more representative of what a person hears. The wave format can vary. In otherwords, it may be 16 KHz 8 bit Mono/Stereo or 8 KHz 16 bit mono, and so forth. Its a wavy line
YELLAPUMADHURI38that periodically repeats while the user is speaking. When in this form, the data isnt useful tospeech recognition because its too difficult to identify any patterns that correlate to what wasactually said. To make pattern recognition easier, the PCM digital audio is transformed into the"frequency domain." Transformations are done using a windowed Fast-Fourier Transform (FFT).The output is similar to what a spectrograph produces. In frequency domain, you can identify thefrequency components of a sound. From the frequency components, its possible to approximatehow the human ear perceives the sound.The FFT analyzes every 1/100th of a second and converts the audio data into the frequencydomain. Each 1/100th of seconds results are a graph of the amplitudes of frequency components,describing the sound heard for that 1/100th of a second. The speech recognizer has a database ofseveral thousand such graphs (called a codebook) that identify different types of sounds the humanvoice can make. The sound is "identified" by matching it to its closest entry in the codebook,producing a number that describes the sound. This number is called the "feature number."(Actually, there are several feature numbers generated for every 1/100th of a second, but theprocess is easier to explain assuming only one.) The input to the speech recognizer began as astream of 16,000 PCM values per second. By using Fast-Fourier Transforms and the codebook, it isboiled down into essential information, producing 100 feature numbers per second.• FIGURE OUT WHICH PHONEMES ARE SPOKENTo figure out which phonemes are spoken the following procedure is used.• Start by grouping. To make the recognition process easier to understand, you first should knowhow the recognizer determines what phonemes were spoken and then understand the grammars.• Every time a user speaks a word, it sounds different. Users do not produce exactly the samesound for the same phoneme.• The background noise from the microphone and users office sometimes causes the recognizerto hear a different vector than it would have if the user were in a quiet room with a high-qualitymicrophone.
YELLAPUMADHURI39• The sound of a phoneme changes depending on what phonemes surround it. The "t" in "talk"sounds different than the "t" in "attack" and "mist."The sound produced by a phoneme changes from the beginning to the end of the phoneme,and is not constant. The beginning of a "t" will produce different feature numbers than the end of a"t."The background noise and variability problems are solved by allowing a feature number to beused by more than just one phoneme, and using statistical models to figure out which phoneme isspoken. This can be done because a phoneme lasts for a relatively long time, 50 to 100 featurenumbers, and its likely that one or more sounds are predominant during that time. Hence, itspossible to predict what phoneme was spoken.The speech recognizer needs to know when one phoneme ends and the next begins. Speechrecognition engines use a mathematical technique called "Hidden Markov Models" (HMMs) thatfigure this out. The speech recognizer figures out when speech starts and stops because it has a"silence" phoneme, and each feature number has a probability of appearing in silence, just like anyother phoneme. Now, the recognizer can recognize what phoneme was spoken if theresbackground noise or the users voice had some variation. However, theres another problem. Thesound of phonemes changes depending upon what phoneme came before and after. You can hearthis with words such as "he" and "how". You dont speak a "h" followed by an "ee" or "ow," but thevowels intrude into the "h," so the "h" in "he" has a bit of "ee" in it, and the "h" in "how" as a bit of"ow" in it.Speech recognition engines solve the problem by creating "tri-phones," which are phonemesin the context of surrounding phonemes. Thus, theres a tri-phone for "silence-h-ee" and one for"silence-h-ow." Because there are roughly 50 phonemes in English, you can calculate that there are50*50*50 = 125,000 tri-phones. Thats just too many for current PCs to deal with so similarsounding tri-phones are grouped together.The sound of a phoneme is not constant. A "t" sound is silent at first, and then produces asudden burst high frequency of noise, which then fades to silence. Speech recognizers solve this bysplitting each phoneme into several segments and generating a different segment for each segment.The recognizer figures out where each segment begins and ends in the same way it figures outwhere a phoneme begins and ends.
YELLAPUMADHURI40A speech recognizer works by hypothesizing a number of different "states" at once. Eachstate contains a phoneme with a history of previous phonemes. The hypothesized state with thehighest score is used as the final recognition result.When the speech recognizer starts listening, it has one hypothesized state. It assumes the userisnt speaking and that the recognizer is hearing the "silence" phoneme. Every 1/100th of a second,it hypothesizes that the user has started speaking, and adds a new state per phoneme, creating 50new states, each with a score associated with it. After the first 1/100th of a second, the recognizerhas 51 hypothesized states.In 1/100th of a second, another feature number comes in. The scores of the existing states arerecalculated with the new feature. Then, each phoneme has a chance of transitioning to yet anotherphoneme, so 51 * 50 = 2550 new states are created. The score of each state is the score of the first1/100th of a second times the score if the 2nd 1/100th of a second. After 2/100ths of a second, therecognizer has 2601 hypothesized states.This same process is repeated every 1/100th of a second. The score of each new hypothesis isthe score of its parent hypothesis times the score derived from the new 1/100th of a second. In theend, the hypothesis with the best score is whats used as the recognition result.• ADAPTATIONSpeech recognition system "adapt" to the users voice, vocabulary, and speaking style toimprove accuracy. A system that has had time enough to adapt to an individual can have one fourththe error rate of a speaker-independent system. Adaptation works because the speech recognition isoften informed (directly or indirectly) by the user if its recognition was correct, and if not, what thecorrect recognition is.The recognizer can adapt to the speakers voice and variations of phoneme pronunciations ina number of ways. First, it can gradually adapt the codebook vectors used to calculate the acousticfeature number. Second, it can adapt the probability that a feature number will appear in aphoneme. Both of these are done by weighted averaging.
YELLAPUMADHURI41The language model also can be adapted in a number of ways. The recognizer can learn newwords, and slowly increase probabilities of word sequences so that commonly used word sequencesare expected. Both these techniques are useful for learning names.Although not common, the speech recognizer can adapt word pronunciations in its lexicon.Each word in a lexicon typically has one pronunciation. The word "orange" might be pronouncedlike "or-anj." However, users will sometimes speak "ornj" or "or-enj." The recognizer canalgorithmically generate hypothetical alternative pronunciations for a word. It then listens for all ofthese pronunciations during standard recognition, "or-anj," "or-enj," "or-inj," and "ornj." Duringthe process of recognition, one of these pronunciations will be heard, although theres a fair chancethat the recognizer heard a different pronunciation than what the user spoke. However, after theuser has spoken the word a number of times, the recognizer will have enough examples that it candetermine what pronunciation the user spoke.However, speech recognition (by a machine) is a very complex problem. Vocalizations varyin terms of accent, pronunciation, articulation, roughness, nasality, pitch, volume, and speed.Speech is distorted by a background noise and echoes, electrical characteristics. Accuracy ofspeech recognition varies with the following:• Vocabulary size and confusability• Speaker dependence vs. independence• Isolated, discontinuous, or continuous speech• Task and language constraints• Read vs. spontaneous speech• Adverse conditions
YELLAPUMADHURI424 RESULTS AND DISCUSSIONS4.1 RESULTSIn this section we will analyze the performance of the system by its capability to recognizegestures from images. We also discuss the difficulties faced while designing the system.4.1.1 APPLICATIONThe software is a standalone application. To install the file, follow the instructions thatappear in the executable installer file.Figure 4.1 Application InstallerAfter installing the application, a Graphical user interfacing window opens, from which thefull application can be used. The graphical User Interface (GUI) has been created to run the entireapplication from a single window. The GUI has four pages namely page 1, page 2, page 3 and page4, each page corresponds to a specific application.
YELLAPUMADHURI43• Page 1 gives a detailed demo of the total software usage.• Page 2 is for speech to sign language translation.• Page 3 is for template preparation for sign to speech translation.• Finally page 4 is for Sign to speech translation.Figure 4.2 Application windowThe functions of the various buttons that appear on the window are as explained below.To run the applicationTo stop the applicationTo go to previous pageTo go to next page
YELLAPUMADHURI4126.96.36.199 PAGE 1This page consists of detailed instructions to execute the entire application. To continue tothe specific application use the previous and next buttons.188.8.131.52 PAGE 2This page consists of Speech to Sign language translator. The window appearance is asshown in figure 4.3. The working of this module is as explained below.Figure 4.3 GUI of Speech to Sign translationBuilding a speech recognition program is not in the scope of this project. Instead integratingan existing speech recognition engine into the program can be very easy. When the “start” button is
YELLAPUMADHURI45pressed, a command is sent to the Windows 7 inbuilt Speech Recognizer and it opens a miniwindow at the top. The first time it is started, a tutorial session begins which gives instructions tosetup the microphone and recognize the user’s voice input. Configure the speech recognitionsoftware. In order for your application to be able to take full advantage of speech recognition, thespeech recognition program must be correctly configured. This means that microphone andlanguage settings must be set appropriately to take optimal advantage of the speech recognitionprograms capabilities.Voice recognition training teaches the software to recognize your voice and speech patterns.Training involves reading the given paragraphs or single words into the software using amicrophone. The more you repeat the process, the more the program should accurately transcribeyour speech.According to a Landmark College article, most people get frustrated with the training processand feel its too time consuming. Before you decide to skip training, you should think about theconsequences. The software will incorrectly transcribe your speech more often than not, which willmake the software less efficient.Speaking clearly and succinctly during training makes it easier for the software to recognizeyour voice. As a result, youll spend less time training, repeating yourself and correcting theprogram. It also helps to use a good-quality microphone that easily registers your voice. Speechrecognition package also tune itself to the individual user. The software customizes itself based onyour voice, your unique speech patterns, and your accent. To improve dictation accuracy, it createsa supplementary dictionary of the words you use.After the initial training, from the next time the program is executed, it starts speechrecognition automatically. To train the system for a different user or change the microphonesettings, right click on the Speech Recognizer window and select “Start Speech Tutorial”.To stop the speech recognition software select the icon or say “Stop listening”. TheSpeech recognizer will go to sleep mode.
YELLAPUMADHURI46Figure 4.4 Speech recognizer in sleep mode.To start speech recognition again select the icon or say “Start Listening”. The Speechrecognizer will go to active mode.Figure 4.5 Speech recognizer in active mode.If the user’s speech input is not clear then it asks to repeat the input again.Figure 4.6 Speech recognizer when input speech is not clear for recognition.The active working mode appearance of the Speech to Sign language translator modulewindow is as shown in figure 4.7. When the user utters any of the words listed in the “Phrases”near the microphone, the input sound is processed for recognition. If the input sound matches thewords in the database, it is displayed in the “Command” alphanumeric indicator. A sign languagegesture picture corresponding to the speech input is displayed in the “Sign” picture indicator. Alsothe score of speech input correlation with the trained word is displayed in the “Score” numericindicator. Use the exit button to exit the application of speech to sign language translation. Toextend the application to translate more input spoken English words to Sign language picturedisplay output, simply include the sign language images in the folder “Sign Images” and add theword to the list in the “Phrases”.
YELLAPUMADHURI47Figure 4.7 GUI of working window of speech to sign translationFigure 4.8 Block diagram of speech to sign translation
YELLAPUMADHURI4184.108.40.206 PAGE 3Figure 4.9 GUI of template preparationFigure 4.10 Block diagram of sign to speech translation
YELLAPUMADHURI49This page consists of template preparation setup for Sign language to Speech translator. Thewindow appearance is as shown in figure 4.11. The working of this module is as explained below.To execute the template preparation module for Sign language to speech translation, press the“Start” button. Choose the camera to acquire images to be used as templates, from the “CameraName” list. The acquired image is displayed on “Image” picture indicator. If the display image isgood to be used for preparing a template, press “Snap frame”. The snapped image is displayed on“Snap Image” picture display. Draw a region of interest to prepare the template and press “Learn”.The image region in the selected portion of the snapped frame is saved to the folder specified fortemplates. The saved template image is displayed on “Template Image” picture display. Press“Stop” button to stop execution of template preparation module.Figure 4.11 GUI of working window of template preparation
YELLAPUMADHURI504.1.1.4 PAGE 4This page consists of Sign to Speech translator. When started it captures the signs performedby the deaf user in real time and compares them with created template images and gives an audiooutput when a match is found. The window appearance is as shown in figure 220.127.116.11. The workingof this module is as explained below.Figure 4.12 GUI of sign to speech translationPress the “Start” button to start the program. The “Camera Name” indicator displays the list ofall the cameras that are connected to the computer. Choose the camera from the list. Adjust theselected camera position to capture the sign gestures performed by the user. For the performed testthe camera is fixed at a distance of one meter from the user’s hand. The captured images are displayedon the “Input Image” picture display. Press the “Match” button to start comparing the acquired inputimage with the template images in the data base. In every iteration, the input image is checked forpattern match with one template. When the input image matches with the template image, the loop
YELLAPUMADHURI51halts. The “Match” LED glows and the matched template is displayed on the “Template Image”indicator. If the input image does not match with any of the images from the database of templates,then the audio output says “NONE” and the “Match” LED do not glow.The loop iteration count is used for triggering a case structure. Depending on the iterationcount value a specific case is selected and gives a string output. Otherwise the loop continues to nextiteration where the input image is checked for pattern match with a new template. The information inthe string output from case structure is displayed on the “Matched Pattern” alphanumeric indicator. Italso initiates the .NET speech synthesizer to give an audio output through the speaker.Figure 4.13 GUI of working window of sign to speech translationTo pause the pattern matching while the program is still running, press the “Match” button.This makes the pattern matching step to go to inactive mode. The acquired image is just displayedon the Input image indicator but does not go for pattern matching. To resume pattern matchingpress the “match” button again. It is highlighted and indicates that it is in active mode.
YELLAPUMADHURI52Figure 4.14 Block diagram of sign to speech translationFigure 4.15 Block diagram of pattern matching
YELLAPUMADHURI53For sign language to spoken English translation, the classification of different gestures isdone using pattern matching technique for 36 different gestures (Alphabets A to Z and numbers 1to 9) of Indian sign language. The performance of the system is evaluated based on its ability tocorrectly recognize signs to their corresponding speech class. The recognition rate is defined as theratio of the number of correctly classified signs to the total number of signs:Recognition Rate= Number of Correctly Classified Signs × (%) 100Total Number of SignsThe proposed approach has been assessed using input sequences containing user performingvarious gestures in indoor environment for alphabets A to Z and numbers 1 to 9. This section willpresent results obtained from a sequence depicting a person performing a variety of hand gesturesin a setup that is typical for deaf and normal person interaction applications. i.e the subject is sittingat a typical distance of about 1m from the camera. The resolution of the sequence is 640* 480 andit was obtained with a standard, low-end web camera at 30 frames per second.The total number of signs used for testing is 36 and the system recognition rate is 100% forinputs similar to database. The system was implemented with LABVIEW version 2012.4.2 DISCUSSIONSFor Sign language to speech translation, the gesture recognition problem consists of patternrepresentation and recognition. In the previous related works, hidden Markov model (HMM) isused widely in speech recognition, and a number of researchers have applied HMM to temporalgesture recognition. Yang and Xu (1994) proposed gesture-based interaction using a multi-dimensional HMM. They used a Fast Fourier Transform (FFT) to convert input gestures to asequence of symbols to train the HMM. They reported 99.78% accuracy for detecting 9 gestures.Watnabe and Yachida (1998) proposed a method of gesture recognition from imagesequences. The input image is segmented using maskable templates and then the gesture space isconstituted by Karhunen-Loeve (KL) expansion using the segment. They applied Eigen vector-based matching for gesture detection.
YELLAPUMADHURI54Oka, Satio and Kioke (2002) developed a gesture recognition based on measured fingertrajectories for an augmented desk interface system. They used a Kalman-Filter for predicting thelocation of multiple fingertips and HMM for gesture detection. They have reported averageaccuracy of 99.2% for single finger gestures produced by one person. Ogawara et al. (2001)proposed a method of constructing a human task model by attention point (AP) analysis. Theirtarget application was gesture recognition for human-robot interaction.New et al. (2003) proposed a gesture recognition system for hand tracking and detecting thenumber of fingers being held up to control an external device, based on hand-shape templatematching. Perrin et al. (2004) described a finger tracking gesture recognition system based on lasertracking mechanism which can be used in hand-held devices. They have used HMM for theirgesture recognition system with an accuracy of 95% for 5 gesture symbols at a distance of 30cm totheir device.Lementec and Bajcsy (2004) proposed an arm gesture recognition algorithm from Eulerangles acquired from multiple orientation sensors, for controlling unmanned aerial vehicles inpresence of manned aircrew. Dias et al. (2004) described their vision-based open gesturerecognition engine called OGRE, reporting detection and tracking of hand contours using templatematching with accuracy of 80% to 90%.Because of the difficulty of data collection for training an HMM for temporal gesturerecognition, the vocabularies are very limited, and to reach to an acceptable accuracy, the processis excessively data and time intensive. Some researchers have suggested that a better approach isneeded for use with more complex systems (Perrin et al., 2004).This paper presents a novel approach for gesture detection. This approach has two mainsteps: i) gesture template preparation, and ii) gesture detection. The gesture template preparationtechnique which is presented here has some important features for gesture recognition includingrobustness against slight rotation, small number of required features, invariant to the start positionand device independence. For gesture detection, a pattern matching technique is used. The resultsof our first experiment show 99.72 % average accuracy in single gesture detection. Based on thehigh accuracy of the gesture classification, the number of templates seems to be enough fordetecting a limited number of gestures. However, more accurate judgment requires a larger numberof gestures in the gesture-space to further validate this assertion.
YELLAPUMADHURI55The gesture recognition technique introduced in this article can be used with a variety offront-end input systems such as vision based input , hand and eye tracking, digital tablet, mouse,and digital glove. Much previous work has focused on isolated sign language recognition withclear pauses after each sign, although the research focus is slowly shifting to continuousrecognition. These pauses make it a much easier problem than continuous recognition withoutpauses between the individual signs, because explicit segmentation of a continuous input streaminto the individual signs is very difficult. For this reason, and because of co-articulation effects,work on isolated recognition often does not generalize easily to continuous recognition.But the proposed software captures the input images as an AVI sequence of continuousimages. This allows for continuous input image acquisition without pauses. But each image frameis processed individually and checked for pattern matching. This technique overcomes the problemof processing continuous images at the same time having input stream without pauses.For Speech to Sign language translation words of similar pronunciation are sometimesmisinterpreted. This problem can be avoided by clearly pronouncing the words and with extendedtraining and increasing usage.ALPHABET A ALPHABET B ALPHABET C ALPHABET D ALPHABET EALPHABET F ALPHABET G ALPHABET H ALPHABET I ALPHABET J
YELLAPUMADHURI56ALPHABET K ALPHABET L ALPHABET M ALPHABET N ALPHABET OALPHABET P ALPHABET Q ALPHABET R ALPHABET S ALPHABET TALPHABET U ALPHABET V ALPHABET W ALPHABET X ALPHABET YALPHABET ZFigure 4.16 Data base of sign templates
YELLAPUMADHURI57NUMBER 1 NUMBER 2 NUMBER 3 NUMBER 4 NUMBER 5NUMBER 6 NUMBER 7 NUMBER 8 NUMBER 9Figure 4.17 Data base of sign number templates
YELLAPUMADHURI585 CONCLUSIONS AND FUTURE ENHANCEMENT5.1 CONCLUSIONSThis sign language translator is able to translate alphabets (A-Z) and numbers (1-9). All thesigns can be translated real-time. But signs that are similar in posture and gesture to another signcan be misinterpreted, resulting in a decrease in accuracy of the system. The current system hasonly been trained on a very small database. Since there will always be variation in either thesigners hand posture or motion trajectory, to increase the performance and accuracy of the system,the quality of the training database used should be enhanced to ensure that the system picks upcorrect and significant characteristics in each individual sign and further improve the performancemore efficiently. A larger dataset will also allow experimenting further on performance in differentenvironments. Such a comparison will allow to tangibly measuring the robustness of the system inchanging environments and provide training examples for a wider variety of situations. Adaptivecolor models and improved tracking could boost performance of the vision system.Current collaboration with Assistive Technology researchers and members of the Deafcommunity for continued design work is under progress. The gesture recognition technology isonly one component of a larger system that we hope to one day be an active tool for the Deafcommunity.This project did not focus on facial expressions although it is well known that facialexpressions convey important part of sign-languages. The facial expressions can e.g. be extractedby tracking the signers’ face. Then, the most discriminative features can be selected by employinga dimensionality reduction method and this cue could also be fused into the recognition system.This system can be implemented in many application areas examples include accessinggovernment websites whereby no video clip for deaf and mute is available or filling out formswhereby no interpreter may be present to help.For the future work, there are many possible improvements that can extend this work. First ofall, more diversified hand samples from different people can be used in the training process so that
YELLAPUMADHURI59the system will be more user independent. The second improvement could be context-awarenessfor the gesture recognition system. The same gesture performed within different contexts andenvironments can have different semantic meanings. Another possible improvement is to track andrecognize multiple objects such as human faces, eye gaze and hand gestures at the same time. Withthis multi-model based tracking and recognition strategy, the relationships and interactions amongthese tracked objects can be defined and assigned with different semantic meanings so that a richercommand set can be covered. By integrating this richer command set with other communicationmodalities such as speech recognition and haptic feedback, the Deaf user communicationinteraction experience can be enriched greatly and be much more interesting.The system developed in this work can be extended to many other research topics in the fieldof computer vision and sign language translation techniques. We hope this project could triggermore investigations to make translation systems see and think better.5.2 FUTURE ENHANCEMENT5.2.1 APPLICATIONS OF SIGN RECOGNITIONThe sign language recognition can be used to assist the communication of Deaf persons tointeract efficiently with non sign language users without the intervention of an interpreter. It can beinstalled at government organizations and other public services. It can be made to incorporate withinternet for live video conferences between deaf and normal people.5.2.2 APPLICATIONS OF SPEECH RECOGNITIONThere are a number of scenarios where speech recognition is either being delivered,developed for, researched or seriously discussed. As with many contemporary technologies, suchas the Internet, online payment systems and mobile phone functionality, development is at leastpartially driven by the trio of often perceived evils.
YELLAPUMADHURI60• COMPUTER AND VIDEO GAMESSpeech input has been used in a limited number of computer and video games, on a variety ofPC and console-based platforms, over the past decade. For example, the game Seaman24 involvedgrowing and controlling strange half-man half fish characters in a virtual aquarium. A microphone,sold with the game, allowed the player to issue one of a pre-determined list of command words andquestions to the fish. The accuracy of interpretation, in use, seemed variable; during gamingsessions colleagues with strong accents had to speak in an exaggerated and slower manner in orderfor the game to understand their commands.Microphone-based games are available for two of the three main video game consoles(Playstation 2 and Xbox). However, these games primarily use speech in an online player to playermanner, rather than spoken words being interpreted electronically. For example, a MotoGP for theXbox allows online players to ride against each other in a motorbike race simulation, and speak(via microphone headset) to the nearest players (bikers) in the race. There is currently interest, butless development, of video games that interpret speech.• PRECISION SURGERYDevelopments in keyhole and micro surgery have clearly shown that an approach of as littleinvasive or non-essential surgery as possible increases success rates and patient recovery times.There is occasional speculation in various medical for a regarding the use of speech recognition inprecision surgery, where a procedure is partially or totally carried out by automated means.For example, in removing a tumour or blockage without damaging surrounding tissue, acommand could be given to make an incision of a precise and small length e.g. 2 millimeters.However, the legal implications of such technology are a formidable barrier to significantdevelopments in this area. If speech was incorrectly interpreted and e.g. a limb was accidentallysliced off, who would be liable – the surgeon, the surgery system developers, or the speechrecognition software developers.
YELLAPUMADHURI61• DOMESTIC APPLICATIONSThere is inevitable, interest in the use of speech recognition in domestic appliances such asovens, refrigerators, dishwashers and washing machines. One school of thought is that, like the useof speech recognition in cars, this can reduce the number of parts and therefore the cost ofproduction of the machine. However, removal of the normal buttons and controls would presentproblems for people who, for physical or learning reasons, cannot use speech recognition systems.• WEARABLE COMPUTERSPerhaps the most futuristic application is in the use and functionality of wearable computers 25i.e. unobtrusive devices that you can wear like a watch, or are even embedded in your clothes.These would allow people to go about their everyday lives, but still store information (thoughts,notes, to-do lists) verbally, or communicate via email, phone or videophone, through wearabledevices. Crucially, this would be done without having to interact with the device, or evenremember that it is there; the user would just speak, the device would know what to do with thespeech, and would carry out the appropriate task.The rapid miniaturization of computing devices, the rapid rise in processing power, andadvances in mobile wireless technologies, are making these devices more feasible. There are stillsignificant problems, such as background noise and the idiosyncrasies of an individual’s language,to overcome. However, it is speculated that reliable versions of such devices will becomecommercially available during this decade.
YELLAPUMADHURI63 K. Abe, H. Saito, S. Ozawa: Virtual 3D Interface System via Hand Motion Recognition FromTwo Cameras. IEEE Trans. Systems, Man, and Cybernetics, Vol. 32, No. 4, pp. 536–540, July2002. Paschaloudi N. Vassilia, Margaritis G. Konstantinos "Listening to deaf: A Greek sign languagetranslator’, 0-7803-9521-2/06/$20.00 §2006 IEEE. Rini Akmeliawatil, Melanie Po-Leen Ooi2, Ye Chow Kuang3 ‘Real-Time Malaysian SignLanguage Translation using Colour Segmentation and Neural Network’, IMTC 2007 -Instrumentation and Measurement Technology Conference Warsaw, Poland, 1-3 May 2007. R. Bowden, D. Windridge, T. Kabir, A. Zisserman, M. Bardy: ‘A Linguaistic Feature Vectorfor the Visual Interpretation of Sign Languag’, In Proceedings of ECCV 2004, the 8th EuropeanConference on Computer Vision, Vol. 1, pp. 391–401, Prague, Czech Republic, 2004. Ravikiran J, Kavi Mahesh, Suhas Mahishi, Dheeraj R, Sudheender S, Nitin V Pujari, (2009),‘Finger Detection for Sign Language Recognition’, Proceedings of the InternationalMultiConference of Engineers and Computer Scientists 2009 Vol I IMECS 2009, March 18 - 20,2009, Hong Kong. S. Akyol, U. Canzler, K. Bengler, W. Hahn: ‘Gesture Control for Use in Automobiles’, InProc. IAPR Workshop Machine Vision Applications, pp. 349–352, Tokyo, Japan, Nov. 2000.
YELLAPUMADHURI64 Verónica López-Lude˜na ∗, Rubén San-Segundo, Juan Manuel Montero, Ricardo Córdoba,Javier Ferreiros, José Manuel Pardo, (2011), ‘Automatic categorization for improving Spanish intoSpanish Sign Language machine translation’, Computer Speech and Language 26 (2012) 149–167. Scientific understanding and vision-based technological development for continuous signlanguage recognition and translation – www.signspeak.eu – FP7-ICT-2007-3-231424.