Your SlideShare is downloading. ×
An Algorithm for Mobile Vision-Based Localization of Skewed Nutrition Labels that Maximizes Specificity
An Algorithm for Mobile Vision-Based Localization of Skewed Nutrition Labels that Maximizes Specificity
An Algorithm for Mobile Vision-Based Localization of Skewed Nutrition Labels that Maximizes Specificity
An Algorithm for Mobile Vision-Based Localization of Skewed Nutrition Labels that Maximizes Specificity
An Algorithm for Mobile Vision-Based Localization of Skewed Nutrition Labels that Maximizes Specificity
An Algorithm for Mobile Vision-Based Localization of Skewed Nutrition Labels that Maximizes Specificity
An Algorithm for Mobile Vision-Based Localization of Skewed Nutrition Labels that Maximizes Specificity
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

An Algorithm for Mobile Vision-Based Localization of Skewed Nutrition Labels that Maximizes Specificity

360

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
360
On Slideshare
0
From Embeds
0
Number of Embeds
28
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. The 2014 International Conference on Image Processing, Computer Vision, & Pattern Recognition (IPCV 2014) An Algorithm for Mobile Vision-Based Localization of Skewed Nutrition Labels that Maximizes Specificity Vladimir Kulyukin Department of Computer Science Utah State University Logan, UT, USA vladimir.kulyukin@usu.edu Christopher Blay Department of Computer Science Utah State University Logan, UT, USA chris.b.blay@gmail.com Abstract—An algorithm is presented for mobile vision-based localization of skewed nutrition labels on grocery packages that maximizes specificity, i.e., the percentage of true negative match- es out of all possible negative matches. The algorithm works on frames captured from the smartphone camera’s video stream and localizes nutrition labels skewed up to 35-40 degrees in either direction from the vertical axis of the captured frame. The algo- rithm uses three image processing methods: edge detection, line detection, and corner detection. The algorithm targets medium- to high-end mobile devices with single or quad-core ARM sys- tems. Since cameras on these devices capture several frames per second, the algorithm is designed to minimize false positives ra- ther than maximize true ones, because, at such frequent frame capture rates, it is far more important for the overall perfor- mance to minimize the processing time per frame. The algorithm is implemented on the Google Nexus 7 Android 4.3 smartphone. Evaluation was done on 378 frames, of which 266 contained NLs and 112 did not. The algorithm’s performance, current limita- tions, and possible improvements are analyzed and discussed. Keywords—computer vision; nutrition label localization; mobile computing; text spotting; nutrition management I. Introduction Many nutritionists and dieticians consider proactive nutri- tion management to be a key factor in reducing and control- ling cancer, diabetes, and other illnesses related to or caused by mismanaged or inadequate diets. According to the U.S. Department of Agriculture, U.S. residents have increased their caloric intake by 523 calories per day since 1970. Misman- aged diets are estimated to account for 30-35 percent of cancer cases [1]. A leading cause of mortality in men is prostate can- cer. A leading cause of mortality in women is breast cancer. Approximately 47,000,000 U.S. residents have metabolic syndrome and diabetes. Diabetes in children appears to be closely related to increasing obesity levels. The current preva- lence of diabetes in the world is estimated to be at 2.8 percent [2]. It is expected that by 2030 the diabetes prevalence number will reach 4.4 percent. Some long-term complications of dia- betes are blindness, kidney failure, and amputations. Nutrition labels (NLs) remain the main source of nutritional information on product packages [3, 4]. Therefore, enabling customers to use computer vision on their smartphones will likely result in a greater consumer awareness of the caloric and nutritional content of purchased grocery products. Figure 1. Skewed NL with vertical axis In our previous research, we developed a vision-based lo- calization algorithm for horizontally or vertically aligned NLs on smartphones [5]. The new algorithm, presented in this paper, improves our previous algorithm in that it handles not only aligned NLs but also those that are skewed up to 35-40 degrees from the vertical axis of the captured frame. Figure 1 shows an example of such a skewed NL with the vertical axis of the captured frame denoted by a white line. Another im- provement designed and implemented in the new algorithm is the rapid detection of the presence of an NL in each frame, which improves the run time, because the new algorithm fails fast and proceeds to the next frame from the video stream. The new algorithm targets medium- to high-end mobile de- vices with single or quad-core ARM systems. Since cameras on these devices capture several frames per second, the algo- rithm is designed to minimize false positives rather than max- imize true ones, because, at such frequent frame capture rates, it is far more important to minimize the processing time per frame.
  • 2. The 2014 International Conference on Image Processing, Computer Vision, & Pattern Recognition (IPCV 2014) The remainder of our paper is organized as follows. In Sec- tion II, we present our previous work on accessible shopping and nutrition management to give the reader a broader context of the research and development presented in this paper. In Section III, we outline the details of our algorithm. In Section IV, we present the experiments with our algorithm and discuss our results. In Section V, we present our conclusions and out- line several directions for future work. II. Previous Work In 2006, our laboratory began to work on ShopTalk, a wearable system for independent blind supermarket shopping [6]. In 2008 - 2009, ShopTalk was ported to the Nokia E70 smartphone connected to a Bluetooth barcode pencil scanner [7]. In 2010, we began our work on computer vision tech- niques for eyes-free barcode scanning [8]. In 2013, we pub- lished several algorithms for localizing skewed barcodes as well as horizontally or vertically aligned NLs [5, 9]. The algo- rithm presented in this paper improves the previous NL locali- zation algorithm by relaxing the NL alignment constraint for up to 35 to 40 degrees in either direction from the vertical orientation axis of the captured frame. Modern nutrition management system designers and de- velopers assume that users understand how to collect nutri- tional data and can be triggered into data collection with digi- tal prompts (e.g., email or SMS). Such systems often under- perform, because many users find it difficult to integrate nutri- tion data collection into their daily activities due to lack of time, motivation, or training. Eventually they turn off or ig- nore digital stimuli [10]. To overcome these challenges, in 2012 we began to devel- op a Persuasive NUTrition Management System (PNUTS) [5]. PNUTS seeks to shift current research and clinical practices in nutrition management toward persuasion, automated nutrition- al information extraction and processing, and context-sensitive nutrition decision support. PNUTS is based on a nutrition management approach inspired by the Fogg Behavior Model (FBM) [10], which states that motivation alone is insufficient to stimulate target behaviors. Even a motivated user must have both the ability to execute a behavior and a trigger to engage in that behavior at an appropriate place or time. Another frequent assumption, which is not always accu- rate, is that consumers and patients are either more skilled than they actually are or that they can be quickly trained to obtain the required skills. Since training is difficult and time consum- ing, a more promising path is to make target behaviors easier and more intuitive to execute for the average smartphone user. Vision-based extraction of nutritional information from NLs on product packages is a fundamental step in making proactive nutrition management easier and more intuitive, because it improves the user’s ability to engage into the target behavior of collecting and processing nutritional data. III. Skewed NL Localization Algorithm A. Detection of Edges, Lines, and Corners Our NL detection algorithm uses three image processing methods: edge detection, line detection, and corner detection. Edge detection transforms images into bitmaps where every pixel is classified as belonging or not belonging to an edge. The algorithm uses the Canny edge detector (CED) [11]. After the edges are detected (see Fig. 2), the image is processed with the Hough Transform (HT) [12] to detect lines (see Fig. 3). The HT algorithm finds paths in images that follow general- ized polynomials in the polar coordinate space. Figure 2. Original NL (left); NL with edges (right) Corner detection is done primarily for text spotting be- cause text segments tend to contain many distinct corners. Thus, image segments with higher concentrations of corners are likely to contain text. Corners are detected with the dilate- erode method [13] (see Fig. 4). Two stages of the dilate-erode method with different 5x5 kernels are applied. Two stages of dilate-erode with different kernels are applied. The first stage uses a 5x5 cross dilate kernel for horizontal and vertical ex- pansions. It then uses a 5x5 diamond erode kernel for diagonal shrinking. The resulting image is compared with the original and those pixels which are in the corner of an aligned rectan- gle are found. Figure 3. NL with edges (left); NL with lines (right)
  • 3. The 2014 International Conference on Image Processing, Computer Vision, & Pattern Recognition (IPCV 2014) The second stage uses a 5x5 X-shape dilate kernel to ex- pand in the two diagonal directions. A 5x5 square kernel is used next to erode the image and to shrink it horizontally and vertically. The resulting image is compared with the original and those pixels which are in a 45 degree corner are identified. The resulting corners from both steps are combined into a final set of detected corners. In Fig. 4, the top sequence of images corresponds to stage one when the cross and diamond kernels are used to detect aligned corners. The bottom sequence of images corresponds to stage two when the X-shape and square kernels are used to detect 45 degree corners. Step one shows the original input of each stage, step two is the image after dilation, step three is the image after erosion, and step four is the difference between the original and eroded versions. The resulting corners are outlined in red in each step to provide a basis of how the di- late-erode operations modify the input. Figure 4. Corner detection steps Fig. 5 demonstrates the dilate-erode algorithm used on an image segment that contains text. The dilate steps are substan- tially whiter than their inputs, because the appropriate kernel is used to expand white pixels. Then the erode steps partially reverse this whitening effect by expanding darker pixels. The result is the pixels with the largest differences between the original image and the result image. Fig. 5 shows corners de- tected on an image segment with text. Our previous NL localization algorithm [5] was based on the assumption that the NL exists in the image and is horizon- tally or vertically aligned with the smartphone’s camera. Un- fortunately, these conditions sometimes do not hold in the real world due to shaking hands or failing eyesight. The exact problem that the new algorithm addresses is twofold. Does a given input image contain a skewed NL? And, if so, within which aligned rectangular area can the NL be localized? In this investigation, a skewed NL is one which has been rotated away from the vertical alignment axis by up to 35 to 40 de- grees in either direction, i.e., left or right. An additional objec- tive is to decrease processing time for each frame to about one second. Figure 5. Corner detection for text spotting B. Corner Detection and Analysis Before the proper NL localization begins, a rotation cor- rection step is performed to align inputs which may be only nearly aligned. This correction is performed by taking ad- vantage of high numbers of horizontal lines found within NLs. All detected lines that are horizontal within 35 to 40 degrees in either direction (i.e., up or down) are used to compute an aver- age horizontal rotation. This rotation is used to perform the appropriate correcting rotation. Corner detection is executed after the rotation. The dilate-erode corner detector is applied to retrieve a two-dimensional bitmap where true white pixels correspond to detected corners and all other false pixels are black. Fig. 5 (right) shows the corners detected in the frame shown in Fig. 5 (left). The dilate-erode corner detector is used specifically be- cause of its high sensitivity to contrasted text, which is why we assume that the region is bounded by these edges contains a large amount of text. Areas of the input image which are not in focus do not produce a large amount of corner detection results and tend not to lie within the needed projection bound- aries. Two projections are computed after the corners are detect- ed. The projections are sums of the true pixels for each row and column. The image row projection has an entry for each row in the image while the image column projection has an entry for each column in the image. The purpose of these pro- jections is to determine boundaries for the top, bottom, left, and right boundaries of the region in which most corners lie. Each value of the projection is averaged together and a projec- tion threshold is set to twice the average. Once a projection threshold is selected, the first and last indexes of each projec- tion greater than the threshold are selected as the boundaries of that projection.
  • 4. The 2014 International Conference on Image Processing, Computer Vision, & Pattern Recognition (IPCV 2014) Figure 5. NL (left); detected corners (right) C. Selection of Boundary Lines After the four corner projections have been computed, the next step is to select the Hough lines that are closest to the boundaries selected on the basis of the four corner projections. In two images of Fig. 6 the four light blue lines are the lines drawn on the basis of the four corner projection counts. The dark blue lines show the lines detected by the Hough trans- form. In Fig. 6 (left), the bottom light blue line is initially chosen conservatively where the row corner projections drop below a threshold. If there is evidence that there are some corners present after the initially selected bottom lines, the bottom line is moved as far below as possible, as shown in Fig. 6 (right). Figure 6. Initial boundaries (left); Final boundaries (right) When the bounded area is not perfectly rectangular, which makes integration with later analysis where a rectangular area is expected to be less straightforward. To overcome this prob- lem, a rectangle is placed around the selected Hough boundary lines. After the four intersection coordinates are computed, their components are compared and combined to find a small- est rectangle that fits around the bounded area. This rectangle is the final result of the NL localization algorithm. As was stated before, the four corners found by the algorithm can be passed to other algorithms such as row dividing, word split- ting, and OCR. Row dividing, world splitting, and OCR are beyond the scope of this paper. Fig. 7 shows a skewed NL localize by our algorithm. Figure 7. Localized Skewed NL IV. Experiments & Results A. Experiment Design We assembled 378 images captured from a Google Nexus 7 Android 4.3 smartphone during a typical shopping session at a local supermarket. Of these images, 266 contained an NL and 112 did not. Our skewed NL localization algorithm was implemented and tested on the same platform with these im- ages. Figure 8. Complete (left) and partial (right) true positives We manually categorized the results into five categories: complete true positives, partial true positives, true negatives, false positives, and false negatives. A complete true positive is an image where a complete NL was localized. A partial true positive is an image where only a part of the NL was localized by the algorithm. Fig. 8 shows examples of complete and par- tial true positives. Fig. 9 shows another example of complete and partial true positives. The image on the left was classified as a complete true positive, because the part of the NL that was not detected
  • 5. The 2014 International Conference on Image Processing, Computer Vision, & Pattern Recognition (IPCV 2014) is insignificant and will likely be fixed through simple padding in subsequent processing. The image on the right, on the other hand, was classified as a partial true positive. While the local- ized area does contain most of the NL, some essential text in the left part of the NL is excluded, which will likely cause failure in subsequent processing. In Fig. 10, the left image technically does not include the entire NL, because the list of ingredients is only partially included. However, we classified it as a complete true positive since it includes the entire table on nutrition facts. The right image of Fig. 10, on the other hand, is classified as a partial true positive, because some parts of the nutrition facts table is not included in the detected area. Figure 9. Complete (left) and partial (right) true positives Figure 10. Complete (left) and partial (right) true positives B. Results Of the 266 images that contained NLs, 83 were classified as complete true positives and 27 were classified as partial true positives, which gives a total true positive rate of 42% and a false negative rate of 58%. All test images with no NLs were classified as true negatives. The remainder of our analy- sis was done via precision, recall, and specificity, and accura- cy. Precision is the percentage of complete true positive matches out of all true positive matches. Recall is the percent- age of true positive matches out of all possible positive match- es. Specificity is the percentage of true negative matches out of all possible negative matches. Accuracy is the percentage of true matches out of all possible matches Table I. NL Localization Results PR TR CR PR SP ACC 0.7632 0.422 0.3580 0.1475 1.0 0.5916 Table I gives the NL localization results where PR stands for “precision,” TR - for “total recall,” CR – for “complete recall,” PR – for “partial recall,” SP – for “specificity,” and ACC – for “accuracy.” . While total and complete recall num- bers are somewhat low, this is a necessary trade-off of maxim- izing specificity. Recall from Section I that we have designed our algorithm to maximize specificity. In other words, the algorithm is less unlikely to detect NLs in images where no NLs are present than in images where they are present. As we argued above, lower recall and precision may not matter much because of the fast rate at which input images are processed on target devices, but there is definitely room for improvement. C. Limitations The majority of false negative matches were caused by blurry images. Blurry images are the result of poor camera focus and instability. Both the Canny edge detector and dilate- erode corner detector require rapid and contrasting changes to identify key points and lines of interest. These points and lines are meant to correspond directly with text and NL borders. These useful data cannot be retrieved from blurry images, which results in run-time detection failures. The only recourse to deal with blurry inputs is improved camera focus and stabil- ity, both of which are outside the scope of this algorithm, be- cause it is a hardware problem. It is likely to work better in later models of smartphones. The current implementation on the Android platform attempts to force focus at the image center but this ability to request camera focus is not present in older Android versions. Over time, as device cameras improve and more devices run newer versions of Android, this limita- tion will have less impact on recall but it will never be fixed entirely. Bottles, bags, cans, and jars (see Fig. 11) have a large showing in the false negative category due to Hough line de- tection difficulties. One possibility to get around this limita- tion is a more rigorous line detection step in which a segment- ed Hough transform is performed and regions which contain connecting detected lines are grouped together. These grouped regions could be used to warp a curved image into a rectangu- lar area for further analysis. Smaller grocery packages (see Fig. 12) tend to have irregu- lar NLs that place a large amount of information into tiny spaces. NLs with irregular layouts present an extremely diffi- cult problem for analysis. Our algorithm better handles more traditional NL layouts with generally empty surrounding are- as. As a better analysis of corner projections and Hough lines is integrated into our algorithm, it will become possible to classify inputs as definitely traditional or more irregular. If
  • 6. The 2014 International Conference on Image Processing, Computer Vision, & Pattern Recognition (IPCV 2014) this classification can work reliably, the method could switch to a much slower and generalized localization to produce bet- ter results in this situation while still quickly returning results for more common layouts. Figure 11. NL with curved lines Figure 12. Irregular NLs V. Conclusions We have made several interesting observations during our experiments. The row and column projects have two distinct patterns. The row projection tends to create evenly spaced short spikes for text in each line of text within the NL while the column projection tends to contain one very large spike where the NL begins at the left due to the sudden influx of detected text. We have not performed any in-depth analysis of these patterns. However, the projection data were collected for each processed image. We plan to do further investigations of these patterns, which will likely allow for run-time detection and corresponding correction of inputs of various rotations. For example, the column projections could be used for greater accuracy in determining the left and right bounds of the NL while row projections could be used by later analysis steps such as row division. Certain projection profiles could eventu- ally be used to select customized localization approaches at run time. During our experiments with and iterative development of this algorithm, we took note of several possible improvements that could positively affect the algorithm’s performance. First, since input images are generally not square, the HT returns more results for lines in the longer dimension, because they are more likely to pass the threshold. Consequently, specifying different thresholds for the two dimensions and combining them for various rotations may produce more consistent re- sults. Second, since only those Hough lines that are nearly verti- cal or horizontal are of use to this method, improvements can be made by only allocating bins for those Θ and ρ combina- tions that are considered important. Fewer bins means less memory to track all of them and fewer tests to determine which bins need to be incremented for a given input. Third, both row and column corner projections tend to produce distinct patterns which could be used to determine better boundaries. After collecting a large amount of typical projections, further analysis can be performed to find generali- zations resulting in a faster method to improve boundary se- lection. Fourth, in principle, a much more intensive HT method can be developed that would divide the image into a grid of smaller segments and perform a separate HT within each seg- ment. One advantage of this approach is to look for the skewed, curved, or even zigzagging lines between segments that could actually be connected into a longer line. While the performance penalty of this method could be quite high, it could allow for the detection and de-warping of oddly shaped NLs. Finally, a more careful analysis of the found Hough lines during the early rotation correction could allow us to detect and localize NLs of all possible rotations, not just skewed ones. The U.S. Food and Drug Administration recently proposed some changes to the design of NLs on product packages [14]. The new design is expected to change how serving sizes are calculated and displayed. Percent daily values are expected to shift to the left side of the NL, which allegedly will make them easier to read. The new design will also require information about added sugars as well as the counts for Vitamin D and potassium. We would like to emphasize that this redesign, which is expected to take at least two years, will not impact the proposed algorithm, because the main tabular components of the new NL design will remain the same. The nutritional information in the new NLs will still be presented textually in rows and columns. Therefore, the corner and line detection and their projections will work as they work on the current NL design. References [1] Anding, R. Nutrition Made Clear. The Great Courses, Chantilly, VA, 2009. [2] Rubin, A. L. Diabetes for Dummies. 3rd Edition, Wiley, Publishing, Inc. Hoboken, New Jersey, 2008. [3] Nutrition Labeling and Education Action of 1990. http://en.wikipedia.org/wiki/Nutrition_Labeling_and_Edu cation_Act_of_1990.
  • 7. The 2014 International Conference on Image Processing, Computer Vision, & Pattern Recognition (IPCV 2014) [4] Food Labelling to Advance Better Education for Life. Avail. at www.flabel.org/en. [5] Kulyukin, V., Kutiyanawala, A., Zaman, T., and Clyde, S. “Vision-based localization and text chunking of nutrition fact tables on android smartphones,” In Proc. International Conference on Image Processing, Computer Vision, and Pattern Recognition (IPCV 2013), pp. 314- 320, ISBN 1-60132-252-6, CSREA Press, Las Vegas, NV, USA, 2013. [6] Nicholson, J. and Kulyukin, V. "ShopTalk: Independent Blind Shopping = Verbal Route Directions + Barcode Scans." In Proceedings of the 30-th Annual Conference of the Rehabilitation Engineering and Assistive Technology Society of North America (RESNA 2007), June 2007, Phoenix, Arizona. Avail. on CD-ROM. [7] Kulyukin, V. and Kutiyanawala, A. “Accessible Shopping Systems for Blind and Visually Impaired Individuals: Design Requirements and the State of the Art.” The Open Rehabilitation Journal, ISSN: 1874-9437, Volume 2, 2010, pp. 158-168, DOI: 10.2174/1874943701003010158. [8] Kulyukin, V., Kutiyanawala, A., and Zaman, T. "Eyes- Free Barcode Detection on Smartphones with Niblack's Binarization and Support Vector Machines." In Proceedings of the 16-th International Conference on Image Processing, Computer Vision, and Pattern Recognition ( IPCV 2012), Vol. I, pp. 284-290, CSREA Press, July 16-19, 2012, Las Vegas, Nevada, USA. ISBN: 1-60132-223-2, 1-60132-224-0. [9] Kulyukin, V. and Zaman T. "Vision-Based Localization of Skewed UPC Barcodes on Smartphones." In Proceedings of the International Conference on Image Processing, Computer Vision, & Pattern Recognition (IPCV 2013), pp. 344-350, pp. 314-320, ISBN 1-60132- 252-6, CSREA Press, Las Vegas, NV, USA. [10] B. J. Fog. "A behavior model for persuasive design," In Proc. 4th International Conference on Persuasive Technology, Article 40, ACM, New York, USA, 2009. [11] Canny, J.F. “A Computational approach to edge detection.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 8, 1986, pp. 679-698. [12] Duda, R. O. and P. E. Hart, "Use of the hough transformation to detect lines and curves in pictures," Comm. ACM, Vol. 15, pp. 11–15, January, 1972. [13] Laganiere, R. OpenCV 2 Computer Vision Application Programming Cookbook. Packt Publishing Ltd, 2011. [14] S. Tavernise. “New F.D.A nutrition labels would make ‘serving sizes’ reflect actual servings.” New York Times, Feb. 27, 2014.

×