Skip Trie Matching for Real Time OCR Output Error Correction on Android Smartphones
SKIP TRIE MATCHING FOR REAL TIME OCR OUTPUT ERROR CORRECTIONON ANDROID SMARTPHONESVladimir KulyukinDepartment of Computer ScienceUtah State UniversityLogan, UT, USAvladimir.firstname.lastname@example.orgAditya VankaDepartment of Computer ScienceUtah State UniversityLogan, UT, USAaditya.email@example.comABSTRACT—Proactive nutrition management isconsidered by many dieticians as a key factor inreducing cancer, diabetes, and other illnesses caused bymismanaged diets. As more individuals manage theirdaily activities with smartphones, they start using theirsmartphones as diet management tools. Unfortunately,while there are many vision-based mobile applications toprocess barcodes, there is a relative dearth of vision-based applications for extracting useful nutritioninformation items such as nutrition facts, caloriccontents, and ingredients. In this paper, we present agreedy algorithm, called Skip Trie Matching (STM), forreal time OCR output error correction on smartphones.The STM algorithm uses a dictionary of strings stored ina trie data structure to correct OCR errors by skippingmisrecognized characters. The number of skippedcharacters is referred to as the skip distance. Thealgorithm’s worst-case performance is logdn ,where is the constant size of the character alphabetover which the trie is constructed (e.g., 26 characters inthe ISO basic Latin alphabet) and n is the length of theinput string to be spellchecked. The algorithm’sperformance is compared with Apache Lucene’s spellchecker, a state of the art spell checker where spellchecking can be done with the n-gram matching or theLevenshtein edit distance. The input data for comparisontests are text strings produced by the Tesserract OCRengine on text image segments of nutrition dataautomatically extracted by an Android 2.3.6 smartphoneapplication from real-time video streams of U.S. groceryproduct packages. Preliminary evaluation results indicatethat, while the STM algorithm is greedy in that it doesnot find all possible corrections of a misspelled word, itgives higher recalls than Lucene’s n-gram matching orLevenshtein edit distance. The average run time of theSTM algorithm is also lower than Lucene’s.KEYWORDS—mobile computing; image processng;vision-based nutrition information extraction; nutritionmanagement, OCR, spellchecking1 INTRODUCTIONAccording to the U.S. Department of Agriculture,U.S. residents have increased their caloric intake by523 calories per day since 1970. Mismanaged dietsare estimated to account for 30-35 percent of cancercases . Approximately 47,000,000 U.S. residentshave metabolic syndrome and diabetes. Diabetes inchildren appears to be closely related to increasingobesity levels. Many nutritionists and dieticiansconsider proactive nutrition management to be akey factor in reducing and controlling cancer,diabetes, and other illnesses related to or caused bymismanaged or inadequate diets.Numerous web sites have been developed totrack caloric intake (e.g.,http://nutritiondata.self.com), to determine caloriccontents and quantities in consumed food (e.g.,http://www.calorieking.com), and to track foodintake and exercise (e.g., http://www.fitday.com).Unfortunately, many such sites either lack mobileaccess or, if they provide it, require manual input ofnutrition data (e.g., , ).One smartphone sensor that may alleviate theproblem of manual input is the camera. Currently,the smartphone cameras are used in many mobileapplications to process barcodes. There are freepublic online barcode databases (e.g.,http://www.upcdatabase.com/) that provide someproduct descriptions and issuing countries’ names.Unfortunately, since production information is
provided by volunteers who are assumed toperiodically upload product details and associatethem with product IDs, almost no nutritionalinformation is available and some of it may not bereliable. Some applications (e.g.,http://redlaser.com) provide some nutritionalinformation for a few popular products.While there are many vision-based applicationsto process barcodes, there continues to be a relativedearth of vision-based applications for extractingother types of useful nutrition information fromproduct packages such as nutrition facts, caloriccontents, and ingredients. If successfully extracted,such information can be converted into text or SQLvia scalable optical character recognition (OCR)methods and submitted as queries to cloud-basedsites and services.In general, there are two broad approaches toimproving OCR quality: improved imageprocessing and OCR engine error correction. Thefirst approach (e.g., , ) strives to achieve betterOCR results via improved image processingtechniques. Unfortunately, this approach may notalways be feasible, especially on mobile off-the-shelf platforms due to processing and networkingconstraints on the amount of real time computationor the impracticality of changing a specific OCRengine. The second approach treats the OCR engineas a black box and attempts to improve its qualityvia automated error correction methods applied toits output. This approach can work with multipleOCR engines and does not increase the run-timeefficiency because it does not modify theunderlying image processing methods.This paper contributes to the body of research onthe second approach. We present an algorithm,called Skip Trie Matching (STM), for real timeOCR output error correction on smartphones. Thealgorithm uses a dictionary of strings stored in a triedata structure to correct OCR errors by skippingmisrecognized characters. The number of skippedcharacters, called the skip distance, is the onlyvariable input parameter of the algorithm.The remainder of our paper is organized asfollows. Section 2 presents related work. Section 3discusses the components of the vision-basednutrition information extraction (NIE) module ofthe ShopMobile system [10, 11] that run prior to theSTM algorithm. The material in this section is notthe main focus of this paper, and is presented togive the reader the broader context in which theSTM algorithm is applied. Section 4 details theSTM algorithm and gives its asymptotic analysis. InSection 5, the STM algorithm’s performance iscompared with Apache Lucene’s n-gram matching and Levenshtein edit distance (LED) .Section 6 analyzes the results and outlines severalresearch venues that we plan to pursue in the future.2 RELATED WORKMany current R&D efforts aim to utilize the powerof mobile computing to improve proactive nutritionmanagement. In , the research is presented thatshows how to design mobile applications forsupporting lifestyle changes among individuals withType 2 diabetes and how these changes wereperceived by a group of 12 patients during a 6-month period. In , an application is presentedthat contains a picture-based diabetes diary thatrecords physical activity and photos taken with thephone camera of eaten foods. The smartphone isconnected to a glucometer via Bluetooth to captureblood glucose values. A web-based, password-secured and encrypted SMS is provided to users tosend messages to their care providers to resolvedaily problems and to send educational messages tousers.The nutrition fact table (NFT) localizationalgorithm outlined in Section 3 is based on verticaland horizontal projections used in many OCRapplications. For example, in , projections areused to successfully detect and recognize Arabiccharacters. The text chunking algorithm, alsooutlined in Section 3, builds on and complementsnumerous mobile OCR projects that capitalize onthe ever increasing processing capabilities ofsmartphone cameras. For example, in , a systemis presented for mobile OCR on mobile phones. In, an interactive system is presented for textrecognition and translation.The STM algorithm detailed in Section 4contributed to a large body of research on spellchecking. Spell checking has been an activeresearch area since early 1960’s (see  for acomprehensive survey). Spelling errors can bebroadly classified as non-word errors and real-word
errors . Non-word errors are charactersequences returned by the OCR engine that are notin the spell checker’s dictionary. For example,‘polassium’ is a non-word error if the spellchecker’s dictionary contains only ‘potassium.’Real-word errors occur when input words aremisrecognized as correctly spelled words. Forexample, if the OCR engine recognizes ‘nutritionfacts’ as ‘nutrition fats,’ the string ‘fats’ is a real-word error. Real-word error correction is beyondthe scope of this paper. It is considered by manyresearchers to be a much harder problem than non-word error correction that may require naturallanguage processing techniques .Two well-known approaches that handle non-word errors are n-grams and edit distances [2, 3].The n-gram approach breaks dictionary words intosub-sequences of characters of length n, i.e., n-grams, where n is typically set to 1, 2, or 3. A tableis computed with the statistics of n-gramoccurrences. When a word is checked for spelling,its n-grams are computed and, if an n-gram is notfound, spelling correction is applied. Spellingcorrection methods use various similarity criteriabetween the the n-grams of a dictionary word andthe misspelled word.The term edit distance denotes the minimumnumber of edit operations such as insertions,deletions, and substitutions required to transformone string into another. Two popular edit distancesare the Levenshtein distance  and the Damerau-Levenshtein distance . The Levenshteindistance (LED) is a minimum cost sequence ofsingle-character replacement, deletion, and insertionoperations required to transform a source string intoa target string. The Damerau-Levenshtein distance(DLED) extends the LED’s set of operations withthe operation of swapping adjacent characters,called transposition. The DLED is not as widelyused in OCR as the LED, because a major source oftransposition errors are typography errors whereasmost OCR errors are caused by misrecognizingcharacters with similar graphic features (e.g., ‘t’misrecognized as ‘l’ or ‘u’ as ‘ll’). Another frequentOCR error type, which does not yield totransposition correction, is character omission froma word due to the OCR engine’ failure to recognizeit.3 NFT LOCALIZATION AND TEXTSEGMENTATIONThe objective of this section is to give the reader abetter appreciation of the broader context in whichthe proposed STM algorithm is applied by outliningthe nutrition information extraction (NIE) modulethat runs prior to the STM algorithm. The NIEmodule is a module of ShopMobile, a mobilevision-based nutrition management system forsmartphone users currently under development atthe Utah State University (USU) Computer ScienceAssistive Technology Laboratory (CSATL) [10, 11,19]. The system will enable smartphone users, bothsighted and visually impaired, to specify theirdietary profiles securely on the web or in the cloud.When they go shopping, they will use theirsmartphones to extract nutrition information fromproduct packages with their smartphones’ cameras.The extracted information is not limited to barcodesbut also includes nutrition facts such as calories,saturated fat, sugar content, cholesterol, sodium,potassium, carbohydrates, protein, and ingredients.3.1 Vertical and Horizontal ProjectionsImages captured from the smartphone’s videostream are divided into foreground and backgroundpixels. Foreground pixels are content-bearing unitswhere content is defined in a domain-dependentmanner, e.g., black pixels, white pixels, pixels withspecific luminosity levels, specific neighborhoodconnectivity patterns, etc. Background pixels arethose that are not foreground. Horizontal projectionof an image (HP) is a sequence of foreground pixelcounts for each row in an image. Vertical projectionof an image (VP) is a sequence of foreground pixelcounts for each column in an image. Figure 1 showshorizontal and vertical projections of a black andwhite image of three characters ‘ABC’.
Figure 1. Horizontal & Vertical Projections.Suppose there is an m x n image I whereforeground pixels are black, i.e., ,0, yxI and thebackground pixels are white, i.e., .255, yxI Thenthe horizontal projection of row y and the verticalprojection of column x can defined as yf and xg , respectively: 1010.,255;,255mynxyxIxgyxIyf (1)For the discussion that follows it is important tokeep in mind that the vertical projections are usedfor detecting the vertical boundaries of NFTs whilethe horizontal projections are used in computing theNFTs’ horizontal boundaries.3.2 Horizontal Line FilteringIn detecting NFT boundaries, three assumptions aremade: 1) a NFT is present in the image; 2) the NFTis not cropped; and 3) the NFT is horizontally orvertically aligned. Figures 2 shows horizontally andvertically aligned NFTs. The detection of NFTboundaries proceeds in three stages. Firstly, the firstapproximation of vertical boundaries is computed.Secondly, the vertical boundaries are extended leftand right. Thirdly, the upper and lower horizontalboundaries are computed.The objective of the first stage is to detect theapproximate location of the NFT along thehorizontal axis ., es xx This approximation startswith the detection of horizontal lines in the image,which is accomplished with a horizontal linedetection kernel (HLDK) described in our previouspublications . It should be noted that other linedetection techniques (e.g., Hough transform )can be used for this purpose. Our HLDK is designedto detect large horizontal lines in images tomaximize computational efficiency. On rotatedimages, the kernel is used to detect vertical lines.The left image of Figure 3 gives the output ofrunning the HLDK filter on the left image of Figure2.Figure 2. Vertically & Horizonally Aligned Tables.Figure 3. HLFI of Fig. 2 (left); its VP (right).3.3 Detection of Vertical BoundariesLet HLFI be a horizontally line filtered image, i.e.,the image put through the HLDK filter or someother line detection filter. Let HLFIVP be thevertical projections of white pixels in each columnof HLFI. The right image in Figure 3 shows thevertical projection of the HLFI on the left. Let VPbe a threshold, which in our application is set to themean count of the white foreground pixels incolumns. In Figure 3 (right), VP is shown by a redline. The foreground pixel counts in the columns ofthe image region with the NFT are greater than thethreshold. The NFT vertical boundaries are thencomputed as:
.&|max;|minrlVPxrVPxlxxcgcxcgcx (2)The pairs of the left and right boundaries detectedby (2) may be too close to each other, where ‘tooclose’ is defined as the percentage of the imagewidth covered by the distance between the right andleft boundaries. If the boundaries are found to betoo close to each other, the left boundary isextended left of the current left boundary, for whichthe projection is at or above the threshold, whereasthe right boundary is extended to the first columnright of the current right boundary, for which thevertical projection is at or above the threshold.Figure 4 (left) shows the initial vertical boundariesextended left and right.Figure 4. VB Extension (left); HP of Left HFLI in Fig. 2(right).3.4 Detection of Horizontal BoundariesThe NFT horizontal boundary computation isconfined to the image region vertically bounded by ., rl xx Let HLFIHP be the horizontal projectionof the HLFI in Figure 3 (left) and let HP be athreshold, which in our application is set to themean count of the foreground pixels in rows, i.e., .0| yfyfmeanHP Figure 4 (right) shows thehorizontal projection of the HLFI in Figure 3 (left).The red line shows .HPThe NFT’s horizontal boundaries are computedin a manner similar to the computation of itsvertical boundaries with one exception – they arenot extended after the first approximation iscomputed, because the horizontal boundaries do nothave as much impact on subsequent OCR ofsegmented text chunks as vertical boundaries. Thehorizontal boundaries are computed as: .&|max;|minuHPlHPurrrfrrrfrr (3)Figure 5 (left) shows the nutrition table localizedvia vertical and horizontal projections andsegmented from the image in Figure 2 (left).3.5 Text ChunkingA typical NFT includes text chunks with variouscaloric and ingredient information, e.g., “Total Fat2g 3%.” As can be seen in Figure 5 (left), textchunks are separated by black colored separators.These text chunks are segmented from localizedNFTs. This segmentation is referred to as textchunking.Figure 5. Localized NFT (left); Text Chunks (right).Text chunking starts with the detection of theseparator lines. Let N be a binarized image with asegmented NFT and let ip denote the probabilityof image row i containing a black separator. If suchprobabilities are reliably computed, text chunks canbe localized. Toward that end, let jl be the length ofthe j-th consecutive run of black pixels in row i
above a length threshold l . If m be the totalnumber of such runs, then ip is computed as thegeometric mean of .,...,, 10 mlll The geometric meanis more indicative of the central tendency of a set ofnumbers than the arithmetic mean. If is the meanvalue of all positive values normalized by themaximum value of ip for the entire image, thestart and end coordinates, sy and ey , respectively, ofevery separator along the y axis can be computed bydetecting consecutive rows for which thenormalized values are above the threshold: .&1|;&1|jpjpjyipipiyes(4)Once these coordinates are identified, the textchunks can be segmented from the image. As can beseen from Figure 5 (right), some text chunkscontain single text lines while others have multipletext lines. The actual OCR in the ShopMobilesystem  takes place on the text chunk imagessuch as shown in Figure 5 (right).4 SKIP TRIE MATCHINGThe trie data structure has gained popularity onmobile platforms due to its space efficiencycompared to the standard hash table and its efficientworst-case lookup times, ,nO where n is the lengthof the input string, not the number of entries in thedata structure. This performance comparesfavorably to the hash table that spends the sameamount of time on computing the hash code butrequires significantly more storage space.On most mobile platforms the trie data structurehas been used for word completion. The STMalgorithm is based on an observation that the trie’sefficient storage of strings can be used in findingclosest dictionary matches to misspelled words. TheSTM algorithm uses the trie data structure torepresent the target dictionary. The only parameterthat controls the algorithm’s behavior is the skipdistance, a non-negative integer that defines themaximum number of misrecognized (misspelled)characters allowed in a misspelled word. In thecurrent implementation, misspelled words areproduced by the Tesseract OCR engine. However,the algorithm generalizes to other domains wherespelling errors must be corrected.Let us begin with a step-by-step example of howthe STMP algorithm works. Consider a triedictionary in Figure 6 that (moving left to right)consists of ‘ABOUT,’ ‘ACID,’ ‘ACORN,’ ‘BAA,’‘BAB,’ ‘BAG,’ ‘BE,’ ‘OIL,’ and ‘ZINC.’ The smallballoons at character nodes are Boolean flags thatsignal word ends. When a node’s word end flag istrue, the path from the root to the node is a word.The children of each node are lexicographicallysorted so that finding a child character of a node is ,lognO where n is the number of the node’schildren.Suppose that skip distance is set to 1 and theOCR engine misrecognizes ‘ACID’ as ‘ACIR.’ TheSTM starts at the root node, as shown in Figure 7.For each child of the root, the algorithm checks ifthe first character of the input string matches any ofthe root’s children. If no match is found and theskip distance > 0, the skip distance is decrementedby 1 and the recursive calls are made for each of theroot’s children. In this case, ‘A’ in the inputmatches the root’s ‘A’ child. Since the match issuccessful, a recursive call is made on theremainder of the input ‘CIR’ and the root node’s‘A’ child at Level 1 as the current node, as shown inFigure 8.Figure 6. Simple Trie Dictionary.
The algorithm next succussefully matches ‘C’ ofthe truncated input ‘CIR’ with the right child of thecurrent node ‘A’ at Level 1, truncates the input to‘IR,’ and recurses at the node ‘C’ at Level 2, i.e.,the right child of the node ‘A’ at Level 1. The skipdistance is still 1, because no mismatched charactershave been skipped so far.Figure 7. STM of ACIR with Skip Distance of 1.At the node ‘C’ at Level 2, the character ‘I’ ofthe truncated input ‘IR’ is matched with the leftchild of the node ‘C,’ the input is then truncated to‘R,’ and the algorithm recurses at the node ‘I’ atLevel 3, as shown in Figure 8.Figure 8. Recursive Call from Node A at Level 1.The last character of the input string, ‘R,’ is nextmatched with the children of the current node ‘I’ atLevel 3 (see Figure 9). The binary search on thenode’s children fails. However, since the skipdistance is 1, i.e., one more character can beskipped, the skip distance is decremented by 1.Since there are no more characters in the inputstring after the mismatched ‘R,’ the algorithmchecks if the current node has a word end flag set totrue. In this case, it is true, because the end wordflag at the node ‘D’ at Level 4 is set to true. Thus,the matched word, ‘ACID,’ is added to the returnedlist of suggestions. When the skip distance is 0 andthe current node is not a word end, the algorithmfails.Edit Distance:Figure 9. Recursive call at node I at Level 3.Figure 12 gives the pseudocode of the STMalgorithm. The dot operator is used with theJava/C++ semantics to access member variables ofan object. The parameter inStr holds a possiblymisspelled string object. The parameter d is a skipdistance. The parameter cn (stands for “currentnode”) is a trie node object initially referencing thetrie’s root. The parameter stn (stands for“suggestion”) is a string that holds the sequence ofcharacters from the trie’s root cn. Foundsuggestions are placed into the array stnList(stands for “suggestion list”). When the algorithmfinishes, stnList contains all possible correctionsof inStr.
Figure 10. End Position Errors along X-axis.Let len(inStr)=n and d=d. The largestbranching factor of a trie node is , i.e., the size ofthe alphabet over which the trie is built. If inStr isin the trie, the binary search on line 14 runs exactlyonce for each of the n characters, which gives us .log nO If inStr is not in the trie, it is allowed tocontain at most d character mismatches. Thus, thereare dn matches and d mismatches. All matchesrun in .log dnO In the worst case, for eachmismatch, lines 18-19 ensure thatd nodes areinspected, which gives us the run timeof loglogddnOdnO . The worst caserarely occurs in practice because in a trie built for anatural language most nodes have branching factorsequal to a small fraction of .To analyze the best case, which occurs when thefirst d characters are misspelled, let .1b Sinceline 20 in Figure 11 ensures that d is decrementedby 1 after every mismatch, the total number ofnodes examined by the algorithm is db . If,Nkd as is the case in practice since the skipdistance is set to a small positive integer, then .log nOnOd Sinced and log are smallconstants for most natural languages with writtenscripts, this asymptotic analysis indicates that STMalgorithm is likely to run faster than the quadraticedit distance algorithms and to be on par with n-gram algorithms.Two current limitations of the STM algorithm, asis evident from Figure 11, are: 1) that it findsspelling suggestions that are of the same length asinStr and 2) that it is incomplete, because it doesnot find all possible misspellings of a misspelledword due to its greediness. In Section 6, we willdiscuss how this limitation can be addressed.5 EXPERIMENTS5.1 Tesseract vs. GOCRTesseract was chosen after a preliminarycomparison with GOCR , another open sourceOCR engine developed under the GNU publiclicense. In Tesseract, OCR consists of segmentationand recognition. During segmentation, a connectedcomponent analysis first identifies blobs andsegments them into text lines, which are analyzedfor proportional or fixed pitch text. The lines arethen broken into individual words via spacingbetween individual characters. Finally, individualcharacter cells are identified. During recognition,an adaptive classifier recognizes both word blobsand character cells. The classifier is adaptive in thatit can be trained on various text corpora. GOCRpreprocesses images via box-detection, zoning, andline detection. OCR is done on boxes, zones, andlines via pixel pattern analysis.The experimental comparison of the two OCRengines was guided by speed and accuracy. AnAndroid 2.3.6 application was developed and testedon two hundred images of NFT text chunks, someof which are shown in Figure 5 (right). Each imagewas processed with both Tesseract and GOCR, andthe processing times were logged. The images wereread from the sdcard one by one. The image readtime was not integrated into the run time total.The application was designed and developed tooperate in two modes: device and server. In thedevice mode, everything was computed on thesmartphone. In the server mode, the HTTP protocolwas used over Wi-Fi to send the images from thedevice to an Apache web server running on Ubuntu
1. wordToMatch = copy(inStr)2. stnList = 3. STM(inStr, d, cn, stn):4. IF len(inStr) == 0 || cn == NULL: fail5. IF len(inStr) == 1:6. IF inStr == cn.char || d > 0:7. add curNode.char to stn8. IF len(stn)==len(wordToMatch) &&9. cn.wordEnd == True:10. add stn to stnList11. ELSE IF len(inStr) > 1:12. IF inStr == cn.char:13. add cn.char to stn14. nn=binSearch(inStr,cn.chidren)15. IF nn != NULL:16. STM(rest(inStr), d, nn, stn)17. ELSE IF d > 0:18. FOR each node C in cn.children:19. add C.char to stn20. STM(rest(inStr), d-1, C, stn)21. ELSE failFigure 11. STM Algorithm.12.04. Images sent from the Android applicationwere handled by a PHP script that initiated OCR onthe server, captured the extracted text and sent itback to the application. Returned text messageswere saved on the smartphone’s sdcard andcategorized by a human judge.Three categorizes were used to categorize thedegree of recognition: complete, partial, andgarbled. Complete recognition denoted imageswhere the text recognized with OCR was identicalto the text in the image. For partial recognition, atleast one character in the returned text had to bemissing or inaccurately substituted. For garbledrecognition, either empty text was returned or allcharacters in the returned text were misrecognized.Table 1 gives the results of the OCR engine textrecognition comparison. The abbreviations TD, GD,TS, and GS stand for ‘Tesseract Device,’ ‘GOCRDevice,’ ‘Tesseract Server,’ and ‘GOCR Server,’respectively. Each cell contains the exact number ofimages out of 200 in a specific category.Table 1. Tesseract vs. GOCR.Complete Partial GarbledTD 146 36 18GD 42 23 135TS 158 23 19GS 58 56 90The numbers in Table 1 indicate that therecognition rates on Tesseract are higher than thoseon GOCR. Tesseract also compares favorably withGOCR in the garbled category where its numbersare lower than those of GOCR.To evaluate the runtime performance of bothengines, we ran the application in both modes onthe same sample of 200 images five times. Table 2tabulates the rounded processing times and averagesin seconds. The numbers 1 through 5 indicate therun number. The AVG column contains the averagetime of the five runs. The AVG/img columncontains the average time per individual image.Table 2. Run Times (in secs) of Tesseract & GOCR.1 2 3 4 5 AVG AVG/ImgTD 128 101 101 110 103 110 0.5GD 50 47 49 52 48 49 0.3TS 40 38 38 10 39 38 0.2GS 21 21 20 21 21 21 0.1Table 2 shows no significant variation among theprocessing times of individual runs. The AVGcolumn indicates that Tesseract is slower thanGOCR. The difference in run times can beattributed to the amount of text recognized by eachengine. Since GOCR extracts less information thanTesseract, as indicated in Table 1, it runs faster.Tesseract, on the other hand, extracts much moreaccurate information from most images, whichcauses it to take more time. While the combined runtimes of Tesseract were slower than those ofGOCR, the average run time per frame, shown inthe AVG/Img column, were still under one second.Tessaract’s higher recognition rates swayed ourfinal decision in its favor. Additionally, as Table 2shows, when the OCR was done on the server,Tesseract’s run times, while still slower thanGOCR, were acceptable to us. This performancecan be further optimized through server-sideservlets and faster image transmission channels(e.g., 4G instead of Wi-Fi).5.2 STM vs. Apache Lucene’s N-Grams & LEDAfter Tesseract was chosen as the base OCRengine, the performance of the STM algorithm was
compared with the Apache Lucene implementationof n-gram matching and the LED. Each of the threealgorithms worked on the text strings produced bythe Tesseract OCR running on Android 2.3.6 andAndroid 4.2 smartphones on a collection of 600 textsegment images produced by the algorithmdescribed in Section 3.The Lucene n-gram and edit distance algorithmsmatching find possible correct spellings ofmisspelled words. The n-gram distance algorithmslices a misspelt word into chunks of size n (ndefaults to 2) and compares them to a dictionary ofn-grams, each of which points to the words in whichthey are found. A word with the largest number ofmatched n-grams is a possible spelling of amisspelled word. If m is the number of n-gramsfound in a misspelled word and K is the number ofn-grams in a dictionary, the time complexity of then-gram matching is O(Km). Since the standard LEDuses dynamic programming, its time complexity isO(n2), where n is the maximum of the lengths oftwo strings being matched. If there are m entries ina dictionary, the run time of the LED algorithm isO(mn2).In comparing the performance of the STMalgorithm with the n-gram matching and the LED,time and accuracy were used as performancemetrics. The average run time taken by the STMalgorithm, the n-gram matching, and the LEDdistance are 20 ms, 50 ms, and 51 ms, respectively.The performance of each algorithm was evaluatedon non-word error correction with the recallmeasure computed as the ratio of corrected andmisspelled words. The recall coefficients of theSTM, n-gram matching, and the LED were 0.15,0.085, and 0.078, respectively. Table 3 gives therun-time and recalls.Table 3. Performance Comparision (in milliseconds).STM N-Gram LEDRun Time 25 51 51Recall 15% 9% 8%6 DISCUSSIONIn this paper, we presented an algorithm, called SkipTrie Matching (STM), for real time OCR outputerror correction on smartphones and compared itsperformance with the n-gram matching and theLED. The experiments on the sample of over 600texts extracted by the Tesseract OCR engine fromtext images of NFT show that the STM algorithmran faster, which is predicted by the asymptoticanalysis, and corrected more words than the n-grammatching and the LED.One limitation of the STM algorithm is that it cancorrect a misspelled word so long as there is a targetword in the trie dictionary of the same length. Oneway to address this limitation is to remove lines 8-10 in the STM algorithm given in Figure 8. Aprimary reason for these lines is to keep the numberof suggestions smaller which improves run-timeperformance.Another approach, which may be more promisingin the long run, is example-driven human-assistedlearning . Each OCR engine working in a givendomain is likely to misrecognize the samecharacters consistently. For example, we havenoticed that Tesseract consistently misrecognized‘u’ as ‘ll’ or ‘b’ as ‘8.’ Such examples can beclassified by a human user as misspellings that thesystem can add to a dictionary of regularmisspellings. In a subsequent run, when theTesseract OCR engine misrecognizes ‘potassium’and as ‘potassillm,’ the STM algorithm, whenfailing to find a possible suggestion for the originalinput, will replace all regularly misspelledcharacters in the input with their misspellings. Thus,‘ll’ in ‘potassill’ will be replaced with ‘u’ to obtain‘potassium’ and get a successful match.Another limitation of the STM algorithm is it isincomplete due to its greediness. The algorithmdoes not find all possible corrected spellings of amisspelled word. To find all possible corrections,the algorithm can be modified to examine all nodeseven when it finds a successful match on line 12.Although it will likely take a toll on the algorithm’srun time, the worst case run time, ,log nOnOd remains the same.An experimental contribution of the researchpresented in this paper is a comparison of Tesseractand GOCR, two popular open source OCR engines,for vision-based nutrition information extraction.GOCR appears to extract less information thanTesseract but has faster run times. While the runtimes of Tesseract were slower, its higher
recognition rates swayed our final decision in itsfavor. The run-time performance, as indicated inTables 1 and 2, can be further optimized throughserver-side servlets and faster image transmissionchannels (e.g., 4G instead of Wi-Fi).Our ultimate objective is to build a mobilesystem that can not only extract nutritioninformation from product packages but also tomatch the extracted information to the users’ dietaryprofiles and to make dietary recommendations toeffect behavior changes. For example, if a user ispre-diabetic, the system will estimate the amount ofsugar from the extracted ingredients and will makespecific recommendations to the user. The system,if the users so choose, will keep track of their long-term buying patterns and make recommendations ona daily, weekly or monthly basis. For example, if auser exceeds his or her total amount of saturated fatpermissible for the specified profile, the system willnotify the user and, if the user’s profile hasappropriate permissions, the user’s dietician.7 ACKNOWLEDGMENTThis project has been supported, in part, by theMDSC Corporation (www.mdsc.com). We wouldlike to thank Dr. Stephen Clyde, MDSC President,for supporting our research and championing ourcause. We are grateful to Dr. Haitao Wang for hiscomments on several drafts of this paper.8 REFERENCES Apache Lucene, http://lucene.apache.org/core/, Retrieved04/15/2013. Jurafsky, D., Martin, J.: Speech and LanguageProcessing: An Introduction to Natural LanguageProcessing, Computational Linguistics, and SpeechRecognition. Prentice Hall, Upper Saddle River, NewJersey, 2000, ISBN: 0130950696 (2000). Levenshtein, V.: Binary Codes Capable of CorrectingDeletions, Insertions, and Reversals, Doklady AkademiiNauk SSSR, 163(4):845-848, 1965 (Russian). Englishtranslation in Soviet Physics Doklady, 10(8):707-710(1966). Tesseract Optical Character Recognition Engine,http://code.google.com/p/tesseract-ocr/, Retrieved04/15/2013. Anding, R.: Nutrition Made Clear. The Great Courses,Chantilly, VA (2009). Kane S, Bigham J, Wobbrock J.: Slide Rule: Making Mobile Touch Screens Accessible to Blind People usingMulti-Touch Interaction Techniques. In Proceedings of10-th Conference on Computers and Accessibility(ASSETS 2008), October, Halifax, Nova Scotia, Canada2008, 73-80 (2008). Kulyukin, V., Crandall, W., and Coster, D.: Efficiency orQuality of Experience: A Laboratory Study of ThreeEyes-Free Touchscreen Menu Browsing User Interfacesfor Mobile Phones. The Open Rehabilitation Journal,Vol. 4, 13-22, DOI: 10.2174/1874943701104010013(2011). K-NFB Reader for Nokia Mobile Phones.:www.knfbreader.com, Retrieved 03/10/2013. Al-Yousefi, H., and Udpa, S.: Recognition of ArabicCharacters. IEEE Transactions on Pattern Analysis andMachine Intelligence, 14, 8, 853-857 (1992). Kulyukin, V., Zaman, T., Andhavarapu, A., andKutiyanawala, A.: Eyesight Sharing in Blind GroceryShopping: Remote P2P Caregiving through CloudComputing. In Proceedings of the 13-th InternationalConference on Computers Helping People with SpecialNeeds (ICCHP 2012), K. Miesenberger et al. (Eds.),ICCHP 2012, Part II, Springer Lecture Notes onComputer Science (LNCS) 7383, pp. 75-82, July 11-13,2012, Linz, Austria (pdf); DOI 10.1007/978-3-642-31522-0; ISBN 978-3-642-31521-3; ISSN 0302-9743(2012). Kulyukin, V. and Kutiyanawala, A.: Accessible ShoppingSystems for Blind and Visually Impaired Individuals:Design Requirements and the State of the Art. The OpenRehabilitation Journal, ISSN: 1874-9437, Volume 2,2010, 158-168, DOI: 10.2174/1874943701003010158(2010). Årsand, E., Tatara, N., Østengen, G., and Hartvigsen, G.:Mobile Phone-Based Self-Management Tools for Type 2Diabetes: The Few Touch Application. Journal ofDiabetes Science and Technology, 4, 2, 328-336 (2010). Frøisland, D.H., Arsand E., and Skårderud F.: ImprovingDiabetes Care for Young People with Type 1 Diabetesthrough Visual Learning on Mobile Phones: Mixed-methods Study. J. Med. Internet Res. 6, 14(4), publishedonline; DOI= 10.2196/jmir.2155 (2012). Bae, K. S., Kim, K. K., Chung, Y. G., and Yu, W. P.:Character Recognition System for Cellular Phone withCamera. In Proceedings of the 29th Annual InternationalComputer Software and Applications Conference,Volume 01, COMPSAC 05, IEEE Computer Society,Washington, DC, USA, 539-544 (2005). Hsueh, M.: Interactive Text Recognition and Translationon a Mobile Device. Masters thesis, EECS Department,University of California, Berkeley (2011). Kukich, K.: Techniques for Automatically CorrectingWords in Text. ACM Computing Surveys (CSUR), v.24n.4, 377-439 (1992). Kai, N.: Unsupervised Post-Correction of OCR Errors.Diss. Master’s Thesis, Leibniz Universität Hannover,Germany (2010). Damerau, F.J.: A technique for computer detection andcorrection of spelling errors. In Communications of ACM,7(3):171-176 (1964). Kulyukin, V., Kutiyanawala, A., and Zaman, T.: Eyes-Free Barcode Detection on Smartphones with NiblacksBinarization and Support Vector Machines. InProceedings of the 16-th International Conference onImage Processing, Computer Vision, and PatternRecognition, Vol. 1, CSREA Press, 284-290, ISBN: 1-60132-223-2, 1-60132-224-0 (2012). Duda, R. O. and P. E. Hart.: Use of the HoughTransformation to Detect Lines and Curves in Pictures.Comm. ACM, Vol. 15, 11-15 (1972). GOCR - A Free Optical Character Recognition Program,http://jocr.sourceforge.net/, Retrieved 04/15/2013.