Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mobile Visual Search


Published on

Published in: Technology

Mobile Visual Search

  1. 1. Mobile Visual Search Oge Marques Florida Atlantic University Boca Raton, FL - USA TEWI  Kolloquium  –  24  Jan  2012    
  2. 2. Take-home message Mobile Visual Search (MVS) is a fascinating researchfield with many open challenges and opportunitieswhich have the potential to impact the way weorganize, annotate, and retrieve visual data (imagesand videos) using mobile devices. Oge  Marques  
  3. 3. Outline •  This talk is structured in four parts: 1.  Opportunities 2.  Basic concepts 3.  Technical details 4.  Examples and applications Oge  Marques  
  4. 4. Part I Opportunities
  5. 5. Mobile visual search: driving factors •  Age of mobile computing h<p://­‐mobile-­‐phones-­‐than-­‐toothbrushes/     Oge  Marques  
  6. 6. Mobile visual search: driving factors •  Smartphone market h<p://     Oge  Marques  
  7. 7. Mobile visual search: driving factors •  Smartphone market h<p://www.cellular-­‐       Oge  Marques  
  8. 8. Mobile visual search: driving factors •  Why do I need a camera? I have a smartphone… h<p://www.cellular-­‐     Oge  Marques  
  9. 9. Mobile visual search: driving factors •  Why do I need a camera? I have a smartphone… h<p://www.cellular-­‐     Oge  Marques  
  10. 10. Mobile visual search: driving factors •  Powerful devices 1 GHz ARM Cortex-A9 processor, PowerVR SGX543MP2, Apple A5 chipset h<p://    h<p://­‐4212.php     Oge  Marques  
  11. 11. Mobile visual search: driving factors Social networks and mobile devices (May 2011) hp://­‐universe-­‐2/     Oge  Marques  
  12. 12. Mobile visual search: driving factors •  Social networks and mobile devices –  Motivated users: image taking and image sharing are huge! :  hp://www.onlinemarkeVng-­‐­‐photo-­‐staVsVcs-­‐and-­‐insights.html     Oge  Marques  
  13. 13. Mobile visual search: driving factors •  Instagram: –  13 million registered (although not necessarily active) users (in 13 months) –  7 employees –  Several apps based on it! hp://­‐13-­‐million-­‐users/     Oge  Marques  
  14. 14. Mobile visual search: driving factors •  Food photo sharing! hp://­‐infographic/     Oge  Marques  
  15. 15. Mobile visual search: driving factors •  Legitimate (or not quite…) needs and use cases hp://­‐by-­‐sight-­‐google-­‐goggles  hps://!/courtanee/status/14704916575       Oge  Marques  
  16. 16. Search system, a low-latency interactive visual search system. base and is the key to very fast retr Several sidebars in this article invite the interested reader to dig features they have in common wit deeper into the underlying algorithms. of potentially similar images is sele Finally, a geometric verificatio Mobile visual search: driving factors ROBUST MOBILE IMAGE RECOGNITION Today, the most successful algorithms for content-based image most similar matches in the datab spatial pattern between features of retrieval use an approach that is referred to as bag of features didate database image to ensure (BoFs) or bag of words (BoWs). The BoW idea is borrowed from Example retrieval systems are pres •  A natural use case for CBIR with QBE (at last!) text retrieval. To find a particular text document, such as a Web page, it is sufficient to use a few well-chosen words. In the For mobile visual search, ther to provide the users with an int –  The example is right in front of the user! database, the document itself can be likewise represented by a deployed systems typically transm the server, which might require t large databases, the inverted file in memory swapping operations slow ing stage. Further, the GV step and thus increases the response t the retrieval pipeline in the follow the challenges of mobile visual se Query Feature Image Extraction [FIG2] A Pipeline for image retrieva from the query image. Feature mat [FIG1] A snapshot of an outdoor mobile visual search system images in the database that have m being used. The system augments the viewfinder with with the query image. The GV step information about the objects it recognizes in the image taken feature locations that cannot be pl with a camera phone. in viewing position.Girod  et  al.  IEEE  MulVmedia  2011   Oge  Marques  
  17. 17. Part II Basic concepts
  18. 18. MVS: technical challenges •  How to ensure low latency (and interactive queries) under constraints such as: –  Network bandwidth –  Computational power –  Battery consumption •  How to achieve robust visual recognition in spite of low-resolution cameras, varying lighting conditions, etc. •  How to handle broad and narrow domains Oge  Marques  
  19. 19. MVS: Pipeline for image retrieval Girod  et  al.  IEEE  MulVmedia  2011   Oge  Marques  
  20. 20. 3 scenarios Girod  et  al.  IEEE  MulVmedia  2011   Oge  Marques  
  21. 21. Part III Technical details
  22. 22. Part III - Outline •  The MVS pipeline in greater detail •  Datasets for MVS research •  MPEG Compact Descriptors for Visual Search (CDVS) Oge  Marques  
  23. 23. MVS: descriptor extraction •  Interest point detection •  Feature descriptor computation Girod  et  al.  IEEE  MulVmedia  2011   Oge  Marques  
  24. 24. Interest point detection •  Numerous interest-point detectors have been proposed in the literature: –  Harris Corners (Harris and Stephens 1988) –  Scale-Invariant Feature Transform (SIFT) Difference-of- Gaussian (DoG) (Lowe 2004) –  Maximally Stable Extremal Regions (MSERs) (Matas et al. 2002) –  Hessian affine (Mikolajczyk et al. 2005) –  Features from Accelerated Segment Test (FAST) (Rosten and Drummond 2006) –  Hessian blobs (Bay, Tuytelaars and Van Gool 2006) –  etc. Girod  et  al.  IEEE  Signal  Processing  Magazine  2011   Oge  Marques  
  25. 25. Interest point detection •  Different tradeoffs in repeatability and complexity: –  SIFT DoG and other affine interest-point detectors are slow to compute but are highly repeatable. –  SURF interest-point detector provides significant speed up over DoG interest-point detectors by using box filters and integral images for fast computation. •  However, the box filter approximation causes significant anisotropy, i.e., the matching performance varies with the relative orientation of query and database images. –  FAST corner detector is an extremely fast interest-point detector that offers very low repeatability. •  See (Mikolajczyk and Schmid 2005) for a comparative performance evaluation of local descriptors in a common framework. Girod  et  al.  IEEE  Signal  Processing  Magazine  2011   Oge  Marques  
  26. 26. Feature descriptor computation •  After interest-point detection, we compute a visual word descriptor on a normalized patch. •  Ideally, descriptors should be: –  robust to small distortions in scale, orientation, and lighting conditions; –  discriminative, i.e., characteristic of an image or a small set of images; –  compact, due to typical mobile computing constraints. Girod  et  al.  IEEE  Signal  Processing  Magazine  2011   Oge  Marques  
  27. 27. Feature descriptor computation •  Examples of feature descriptors in the literature: –  SIFT (Lowe 1999) –  Speeded Up Robust Feature (SURF) interest-point detector (Bay et al. 2008) –  Gradient Location and Orientation Histogram (GLOH) (Mikolajczyk and Schmid 2005) –  Compressed Histogram of Gradients (CHoG) (Chandrasekhar et al. 2009, 2010) •  See (Winder, (Hua,) and Brown CVPR 2007, 2009) and (Mikolajczyk and Schmid PAMI 2005) for comparative performance evaluation of different descriptors. Girod  et  al.  IEEE  Signal  Processing  Magazine  2011   Oge  Marques  
  28. 28. Feature descriptor computation •  What about compactness? –  Several attempts in the literature to compress off-the- shelf descriptors did not lead to the best-rate- constrained image-retrieval performance. –  Alternative: design a descriptor with compression in mind. Girod  et  al.  IEEE  Signal  Processing  Magazine  2011   Oge  Marques  
  29. 29. Feature descriptor computation •  CHoG (Compressed Histogram of Gradients) (Chandrasekhar et al. 2009, 2010) –  Based on the distribution of gradients within a patch of pixels –  Histogram of gradient (HoG)-based descriptors [e.g. (Lowe 2004), (Bay et al. 2008), (Dalal and Triggs 2005), (Freeman and Roth 1994), and (Winder et al. 2009)] have been shown to be highly discriminative at low bit rates. Girod  et  al.  IEEE  Signal  Processing  Magazine  2011   Oge  Marques  
  30. 30. CHoG: Compressed Histogram of Gradients Gradients Gradient distributions Patch for each bin dx dy dx dy 011101 Spatial 0100101 binning 01101 101101 Histogram 0100011 111001 compression 0010011 01100 1010100 CHoG
 Descriptor Bernd Girod: Mobile Visual SearchChandrasekhar  et  al.  CVPR  09,10   Oge  Marques  
  31. 31. the context for each spatial bin. I LHC provides two key benefits. First, encoding the (x, locations of a set of N features as the histogram reduces we the bit rate by log(N!) compared to encoding each feature rate Encoding descriptor’s location information location in sequence [47]. Here, we exploit the fact that the features can be sent in any order. Consider the sample VGA loca space that represents N features. There are N! number of tion codes that represent the same feature set because the tati •  Location Histogram Coding (LHC) order does not matter. Thus, if we fix the ordering for the fixe –  Rationale: Interest- point locations in images tend to cluster spatially. [FIG a lo [FIG S3] Interest-point locations in images tend to cluster spa spatially. blocGirod  et  al.  IEEE  Signal  Processing  Magazine  2011   Oge  Marques  
  32. 32. spatial different coding gains. In our experiments, Hessian Laplaceounts, has the highest gain followed by SIFT and SURF interest-ns. We point detectors. Even if the feature points are uniformly based scattered in the image, LHC is still able to exploit the sed as ordering gain, which results in log(N!) saving in bits. g the Encoding descriptor’s location information In our experiments, we found that quantizing the (x, y) location to four-pixel blocks is sufficient for GV. Ifeduces we use a simple fixed-length coding scheme, then theeature rate will be log(640/4) 1 log(640/4) z 14 b/feature for a t that VGA size image. Using LHC, we can transmit the same •  Method: ampleber of •  Location Histogram location data with z 5 b/descriptor; z 12.5 times reduc- tion in data compared to a 64-b floating point represen- 1.  Generate a 2D histogram fromse the or the Coding (LHC) tation and z 2.8 times rate reduction compared to fixed-length coding [48]. the locations of the descriptors. •  Divide the image into spatial bins and count the number of features within each spatial bin. 2.  Compress the binary map, indicating which spatial bins contains features, and a sequence 1 of feature counts, representing 2 1 the number of features in 1 1 3 occupied bins. 1 1 1 1 3.  Encode the binary map using a trained context-based arithmetic [FIG S4] We represent the location of the descriptors using coder, with the neighboring bins a location histogram. The image is first divided into evenly being used as the context forer spaced blocks. We enumerate the features within each spatial block by generating a location histogram. each spatial bin. be tra- ed by images taken from multiple view points. The size of the inverted index is reduced by using geometry to find matching Oge  Marques   Girod  et  al.  IEEE  Signal  Processing  Magazine  2011  sed to features across images, and only retaining useful features anddevel- discarding irrelevant clutter features.
  33. 33. MVS: feature indexing and matching •  Goal: produce a data structure that can quickly return a short list of the database candidates most likely to match the query image. –  The short list may contain false positives as long as the correct match is included. –  Slower pairwise comparisons can be subsequently performed on just the short list of candidates rather than the entire database. Girod  et  al.  IEEE  MulVmedia  2011   Oge  Marques  
  34. 34. clustering is applied to the training descriptors assigned to fall in same cluster. that cluster, to generate k smaller clusters. This recursive di- During a query, the VT is traversed for each feature in vision of the descriptor space is repeated until there are the query image, finishing at one of the leaf nodes. The enough bins to ensure good classification performance. corresponding lists of images and frequency counts are MVS: feature indexing and matching Figure B1 shows a VT with only two levels, branching factor k ¼ 3, and 32 ¼ 9 leaf nodes. In practice, VT can be much larger, for example, with height 6, branching factor k ¼ subsequently used to compute similarity scores be- tween these images and the query image. By pulling images from all these lists and sorting them according 10, and containing 106 ¼ 1 million nodes. to the scores, we arrive at a subset of database images •  Vocabulary Tree (VT)-Based Retrieval The associated inverted index structure maintains two lists for each VT leaf node, as shown in Figure B2. For a that is likely to contain a true match to the query image. 1 2 3 Training descriptor Root node 7 8 1st level intermediate node 4 5 2nd level leaf node 9 6 (1) Vocabulary tree Inverted index i11 i12 i13 ... i1N1 c11 c12 c13 ... c1N 1 i21 i22 i23 ... i2N2 1 2 3 4 5 6 7 8 9 ... c2N2 c21 c22 c23 (2)Girod  et  al.  IEEE  MulVmedia  2011   B. (1) Vocabulary tree and (2) inverted index structures. Figure Oge  Marques  
  35. 35. MVS: geometric verification •  Goal: use location information of features in query and database images to confirm that the feature matches are consistent with a change in view-point between the two images. Girod  et  al.  IEEE  MulVmedia  2011   Oge  Marques  
  36. 36. counts are far from code. Index compression reduces memory usage from near- be much more rate ly 10 GB to 2 GB. This five times reduction leads to a sub-g the distributions of stantial speedup in server-side processing, as shown inh inverted list can be [63]. Since keeping tant for interactive MVS: geometric verification Figure S6(b). Without compression, the large inverted index causes swapping between main and virtual memory and slows down the retrieval engine. After compression,me that allows ultra- memory swapping is avoided and memory congestion •  Method: perform pairwise matching of featureC. The carryover code delays no longer contribute to the query latency. descriptors and evaluate geometric consistency of correspondences. checks to rerankcale information of 69] propose incor-he VT matching oruthors investigatetself. Philbin et al. atures to proposemation model ands. Weak geometric to rerank a largerGV is performed on [FIG4] In the GV step, we match feature descriptors pairwise and find feature correspondences that are consistent with a geometricetric reranking step model. True feature matches are shown in red. False featureted inGirod  et  al.  IEEE  MulVmedia  2011   Figure 5. In matches are shown in green. Oge  Marques  
  37. 37. MVS: geometric verification •  Techniques: –  The geometric transform between the query and database image is usually estimated using robust regression techniques such as: •  Random sample consensus (RANSAC) (Fischler and Bolles 1981) •  Hough transform (Lowe 2004) –  The transformation is often represented by an affine mapping or a homography. •  GV is computationally expensive, which is why it’s only used for a subset of images selected during the feature-matching stage. Girod  et  al.  IEEE  MulVmedia  2011   Oge  Marques  
  38. 38. mation itself. Philbin et al.ching features to propose transformation model and MVS: geometric reranking ypotheses. Weak geometriclly used to rerank a largere a full GV is performed on •  Speed-up step between Vocabulary Tree building [FIG4] In the GV step, we match feature descriptors pairwise and find feature correspondences that are consistent with a geometric and Geometric Verification. are shown in red. False featured a geometric reranking step model. True feature matchess illustrated in Figure 5. In matches are shown in green.tep that mationup staget of top Query Geometric Identify VT GV Data Reranking Information ometric e of thex, y fea- [FIG5] An image retrieval pipeline can be greatly sped up by incorporating a geometricuse scale reranking stage. IEEE SIGNAL PROCESSING MAGAZINE [69] JULY 2011 Oge  Marques  
  39. 39. Fast geometric reranking ing algorithm inerank a short list set of potential log (÷) database imagenerating a set of geometric score (a) (b) (c) (d)te the geometricWe find the dis- location geometric score is computed as follows: •  The [FIG S7] The location geometric score is computed as uery image a)  features of twofeatures are matched based on VT quantization; and follows: (a) images of two images are matched based on VT quantization, (b) distances between pairs of features atching features distances between pairs of features within an image are calculated; b)  distance corre- within an image are calculated, (c) log-distance ratios of the c)  log-distance ratiospairs (denoted by color) pairs (denotedand color) are corresponding of the corresponding are calculated, bythe two images. calculated; and (d) histogram of log-distance ratios is computed. Theres in the query histogram of log-distance histogram is the geometric similarity d)  maximum value of the ratios is computed. es. If there exists score. A peak in the histogram indicates a similarity •  The maximum value of the histogram is the geometric peak in the his- transform between the query and database image. similarity score. y that the query –  A peak in the histogram indicates a similarity transform between The time required to calculate a geometric similarity score use we use the the query to two orders of magnitude less than using RANSAC. is one and database image. otential feature Typically, we perform fast geometric reranking on the top scoring scheme.MulVmedia  2011   Girod  et  al.  IEEE   500 images and RANSAC on the top 50 ranked images. Oge  Marques  
  40. 40. Datasets for MVS research •  Stanford Mobile Visual Search Data Set ( –  Key characteristics: •  rigid objects •  widely varying lighting conditions •  perspective distortion •  foreground and background clutter •  realistic ground-truth reference data •  query data collected from heterogeneous low and high-end camera phones. Chandrasekhar  et  al.  ACM  MMSys  2011   Oge  Marques  
  41. 41. Stanford Mobile Visual Search (SMVS) Data Set •  Limitations of popular datasets Data Database Query Classes Rigid Lighting Clutter Perspective Camera Set (#) (#) (#) Phone √ √ √ ZuBuD 1005 115 200 √ − √ √ √ − Oxford 5062 55 17 √ √ × INRIA 1491 500 500 − √ − √ − UKY 10200 2550 2550 − √ − √ √ − ImageNet 11M 15K 15K − √ √ √ √ − √ SMVS 1200 3300 1200able 1: Comparison of different data sets. “Classes” refers to the number of distinct objects in the data set.Rigid” refers to whether on not the objects in the database are rigid. “Lighting” refers to whether or not ZuBuD e query images capture widely varying lighting conditions. “Clutter” refers to whether or not the querymages contain foreground/background clutter. “Perspective” refers to whether the data set contains typical rspective distortions. “Camera-phone” refers to whether the images were captured with mobile devices.MVS is a good data set for mobile visual search applications. Oxfordries like CDs, DVDs, books, text documents and business affine models with a minimum threshold of 10 matches post-rds, we capture the images indoors under widely varying RANSAC for declaring a pair of images to be a valid match.hting conditions over several days. We include foreground In Fig. 3, we report results for 3 state-of-the-art schemes:d background clutter that would be typically present in (1) SIFT INRIA Difference-of-Gaussian (DoG) interest point de-e application, e.g., a picture of a CD would might other tector and SIFT descriptor (code: [27]), (2) Hessian-affineDs in the background. For landmarks, we capture images interest point detector and SIFT descriptor (code [17]), and buildings in San Francisco. We collected query images (3) Fast Hessian blob interest point detector [2] sped upveral months after the reference data was collected. For UKY with integral images, and the recently proposed Compresseddeo clips, the query images were taken from laptop, com- Histogram of Gradients (CHoG) descriptor [4]. We report ter and TV screens to include typical specular distortions. the percentage of images that match, the average numbernally, the paintings were captured at the Cantor Arts Cen- of features and the average number of features that match at Stanford University under controlled lighting condi- Image Nets post-RANSAC for each category. ns typical of museums. et  al.  ACM  MMSys  2011   Chandrasekhar   Oge  Marques   Figure 2: Limitations with popular data sets in are easier vision. The left most image in each row is First, we note that indoor categories computer than out-The resolution of the query images varies for each camera database image, and the E.g., some categories like CDs, ZuBuD,and door categories. other 3 images are query images. DVDs INRIA and UKY consist of images ta one. We provide the original JPEG compressed high qual- at the book time and location. ImageNets is not suitable for image retrieval applications. The Oxford data same covers achieve over 95% accuracy. The most challeng- has different faades of the same building labelled with the same name.
  42. 42. SMVS Data Set: categories and examples •  Number of query and database images per category Chandrasekhar  et  al.  ACM  MMSys  2011   Oge  Marques  
  43. 43. SMVS Data Set: categories and examples •  DVD covers hp://­‐2011-­‐dataset/stanford/mvs_images/dvd_covers.html     Oge  Marques  
  44. 44. SMVS Data Set: categories and examples •  CD covers hp://­‐2011-­‐dataset/stanford/mvs_images/cd_covers.html     Oge  Marques  
  45. 45. SMVS Data Set: categories and examples •  Museum paintings hp://­‐2011-­‐dataset/stanford/mvs_images/museum_painVngs.html     Oge  Marques  
  46. 46. Other MVS data sets ISO/IEC  JTC1/SC29/WG11/N12202  -­‐  July  2011,  Torino,  IT   Oge  Marques  
  47. 47. Other MVS data sets ISO/IEC  JTC1/SC29/WG11/N12202  -­‐  July  2011,  Torino,  IT   Oge  Marques  
  48. 48. Other MVS data sets •  Distractor set –  1 million images of various resolution and content collected from FLICKR. ISO/IEC  JTC1/SC29/WG11/N12202  -­‐  July  2011,  Torino,  IT   Oge  Marques  
  49. 49. MPEG Compact Descriptors for Visual Search (CDVS) •  Objectives –  Define a standard that: •  enables design of visual search applications •  minimizes lengths of query requests •  ensures high matching performance (in terms of reliability and complexity) •  enables interoperability between search applications and visual databases •  enables efficient implementation of visual search functionality on mobile devices •  Scope –  It is envisioned that (as a minimum) the standard will specify: •  bitstream of descriptors •  parts of descriptor extraction process (e.g. key-point detection) needed to ensure interoperability Bober,  Cordara,  and  Reznik  (2010)   Oge  Marques  
  50. 50. MPEG CDVS •  Requirements –  Robustness •  High matching accuracy shall be achieved at least for images of textured rigid objects, landmarks, and printed documents. The matching accuracy shall be robust to changes in vantage point, camera parameters, lighting conditions, as well as in the presence of partial occlusions. –  Sufficiency •  Descriptors shall be self-contained, in the sense that no other data are necessary for matching. –  Compactness •  Shall minimize lengths/size of image descriptors –  Scalability •  Shall allow adaptation of descriptor lengths to support the required performance level and database size. •  Shall enable design of web-scale visual search applications and databases. Bober,  Cordara,  and  Reznik  (2010)   Oge  Marques  
  51. 51. MPEG CDVS •  Requirements (cont’d) –  Image format independence •  Descriptors shall be independent of the image format –  Extraction complexity •  Shall allow descriptor extraction with low complexity (in terms of memory and computation) to facilitate video rate implementations –  Matching complexity •  Shall allow matching of descriptors with low complexity (in terms of memory and computation). •  If decoding of descriptors is required for matching, such decoding shall also be possible with low complexity. –  Localization: •  Shall support visual search algorithms that identify and localize matching regions of the query image and the database image •  Shall support visual search algorithms that provide an estimate of a geometric transformation between matching regions of the query image and the database image Bober,  Cordara,  and  Reznik  (2010)   Oge  Marques  
  52. 52. MPEG CDVS [3B2-9] mmu2011030086.3d 1/8/011 16:44 Page 93 •  Summarized timeline Table 1. Timeline for development of MPEG standard for visual search. When Milestone Comments March, 2011 Call for Proposals is published Registration deadline: 11 July 2011 Proposals due: 21 November 2011 December, 2011 Evaluation of proposals None February, 2012 1st Working Draft First specification and test software model that can be used for subsequent improvements. July, 2012 Committee Draft Essentially complete and stabilized specification. January, 2013 Draft International Standard Complete specification. Only minor editorial changes are allowed after DIS. July, 2013 Final Draft International Finalized specification, submitted for approval and Standard publication as International standard. that among several component technologies for existing standards, such as MPEG Query For- image retrieval, such a standard should focus pri- mat, HTTP, XML, JPEG, and JPSearch. marily on defining the format of descriptors andGirod  et  al.  IEEE  MulVmedia  2011   Oge  Marques   parts of their extraction process (such as interest Conclusions and outlook point detectors) needed to ensure interoperabil- Recent years have witnessed remarkable
  53. 53. MPEG CDVS •  CDVS: evaluation framework –  Experimental setup •  Retrieval experiment: intended to assess performance of proposals in the context of an image retrieval system ISO/IEC  JTC1/SC29/WG11/N12202  -­‐  July  2011,  Torino,  IT   Oge  Marques  
  54. 54. MPEG CDVS •  CDVS: evaluation framework –  Experimental setup •  Pair-wise matching experiments: intended for assessing performance of proposals in the context of an application that uses descriptors for the purpose of image matching. Annota-­‐ Vons   Check   accuracy  of   Report   search  results   Image  A   Extract   descriptor   Match   Image  B   Extract   descriptors  ISO/IEC  JTC1/SC29/WG11/N12202  -­‐  July  2011,  Torino,  IT   Oge  Marques  
  55. 55. MPEG CDVS •  For more info: – – (Ad hoc groups) Oge  Marques  
  56. 56. Part IV Examples and applications
  57. 57. Examples •  Academic –  Stanford Product Search System •  Commercial –  Google Goggles –  Kooaba: Déjà Vu and Paperboy –  SnapTell –  oMoby (and the IQ Engines API) –  pixlinQ –  Moodstocks Oge  Marques  
  58. 58. list of important references for each module in the matching (500 3 500 pixel resolution) [75] exhibiting challenging pho-pipeline in Table 2. tometric and geometric distortions, as shown in Figure 7. For Stanford Product Search (SPS) System [TABLE 2] SUMMARY OF REFERENCES FOR MODULES IN A MATCHING PIPELINE. MODULE LIST OF REFERENCES •  Local feature based visual search system FEATURE EXTRACTION HARRIS AND STEPHENS [17], LOWE [15], [23], MATAS ET AL. [18], MIKOLAJCZYK ET AL. [16], [22], DALAL AND TRIGGS [41], ROSTEN AND DRUMMOND [19], BAY ET AL. [20], WINDER ET AL. [27], [28], CHANDRASEKHAR ET AL. [25], [26], PHILBIN ET AL. [40] FEATURE INDEXING AND MATCHING SCHMID AND MOHR [13], LOWE [15], [23], SIVIC AND ZISSERMAN [9], NISTÉR AND STEWÉNIUS [10], CHUM ET AL. [50], [52], [53], YEH ET AL. [51], PHILBIN ET AL. [12], JEGOU ET AL. [11], [59], [60], ZHANG ET AL. [54] CHEN ET AL. [58], PERRONNIN [61], MIKULIK ET AL. [55], TURCOT AND LOWE [56], LI ET AL. [57] GV FISCHLER AND BOLLES [66], SCHAFFALITZKY AND ZISSERMAN [74], LOWE [15], [23], CHUM ET AL. [53], [70], [71] •  Client-server architecture FERRARI ET AL. [68], JEGOU ET AL. [11], WU ET AL. [69], TSAI ET AL. [73] Query Data Image Feature Feature VT Matching Extraction Compression Network Display GV Client Identification Data Server[FIG6] Stanford Product Search system. Because of the large database, the image-recognition server is placed at a remote location. Inmost systems [1], [3], [7], the query image is sent to the server and feature extraction is performed. In our system, we show that byperforming feature extraction on the phone we can significantly reduce the transmission delay and provide an interactive experience. IEEE SIGNAL PROCESSING MAGAZINE [70] JULY 2011 Girod  et  al.  IEEE  MulVmedia  2011   Tsai  et  al.  ACM  MM  2010   Oge  Marques  
  59. 59. Stanford Product Search (SPS) System •  Key contributions: –  Optimized feature extraction implementation –  CHoG: a low bit-rate compact descriptor (provides up to 20× bit-rate saving over SIFT with comparable image retrieval performance) –  Inverted index compression to enable large-scale image retrieval on the server –  Fast geometric re-ranking Girod  et  al.  IEEE  MulVmedia  2011  Tsai  et  al.  ACM  MM  2010   Oge  Marques  
  60. 60. The system including different distances, viewing angles, Figure 1. A the viewfind Mobile image-based retrieval and lighting conditions, or in the presence of of an outdo information technologies partial occlusions or motion blur. visual-searc objects it rec Most successful algorithms for image-based The system the image ta retrieval today use an approach that is referred Stanford Product Search (SPS) System the viewfind a phone cam Mobile image-based(BoF) or bag of words to as bag of features retrieval information technologiesBoW idea is borrowed from text (BoW).1,2 The objects it re document retrieval. To find afor image-based Most successful algorithms particular text the image ta retrieval today use anwebpage, it’s sufficient to document, such as a approach that is referred a phone cam to as few well-chosen words. or the database, use a bag of features (BoF) In bag of words •  Two modes: (BoW).1,2 The BoW idea is borrowed be repre- the document itself can likewise from text characteristic of a particular image take the document a bag of salient words, regardless sented by retrieval. To find a particular text role of visual words. As with text retrieval, document, such as a webpage, it’s sufficient to of where these words appear in the document. BoF image retrieval does not consider where use aimages, robust local features database, For few well-chosen words. In the that are –  Send Image mode in the image the features occur, at least in the the document itself can likewise be repre- characteristic of a particular image take the sented by a bag of salient words, regardless role of visual words. As with text retrieval, Figure 2. Mo Mobile phone of where these words appear in the document. Visual search server BoF image retrieval does not consider where search archi For images, robust local features that are in the image the features occur, at least in the (a) The mob Image encoding Image Descriptor Image transmits th (JPEG) decoding extraction compressed Figure 2. Mo Mobile phone Visual search server Wireless network Descriptor while analys search arch matching Database image and r (a) The mob Image encoding Image Descriptor are done ent Image Process and (JPEG) Search transmits th decoding extraction display results results remote serve compressed Wireless network Descriptor local image while analy matching Database (descriptors) image and r (a) extracted en are done on Process and Search display results results phone and t remote serve Mobile phone Visual search server encoded and –  Send Feature mode local image transmitted (descriptors (a) Descriptor Descriptor Descriptor Image network. Su extracted on extraction encoding decoding descriptors a phone and t Mobile phone Visual search server Wireless network Descriptor used by the encoded and matching Database perform the transmitted Descriptor Descriptor Descriptor (c) The mob network. Su Image Process and extraction encoding Search decoding display results results maintains a descriptors a the databas used by the Wireless network Descriptor matching Database search reque perform the (b) remote serve (c) The mob Process and Search display results object of int maintains a Mobile phone results Visual search server found in thi the databas further requ search reduGirod  et  al.  IEEE  MulVmedia  2011   (b) Image Descriptor Descriptor Descriptor amountserve remote of d extraction encoding decodingTsai  et  al.  ACM  MM  2010   Oge  Marques   over the netw object of int Mobile phone No Visual search server Descriptor Found Wireless network Descriptor found in th matching matching Database further redu Descriptor Descriptor Descriptor
  61. 61. Stanford Product Search System •  Performance evaluation –  Dataset: 1 million CD, DVD, and book cover images + 1,000 query images (500×500) with challenging photometric and geometric distortions (a) (b)Girod  et  al.  IEEE  distortions. 2011   pairs from the data set. (a) A clean database picture is matched against (b) a real-world picture with various [FIG7] Example image MulVmedia   Oge  Marques   the client, we use a Nokia 5800 mobile phone with a 300-MHz Figure 8 compares the recall for the three schemes: send
  62. 62. Stanford Product Search System [3B2-9] mmu2011030086.3d 30/7/011 16:27 Page 92 •  Performance evaluation –  Recall vs. bit rate Industry and Standards 100 features, as they arrive.15 On 98 finds a result that has sufficien ing score, it terminates the searc 96 ately sends the results back. T optimization reduces system Classification accuracy (%) 94 other factor of two. 92 Overall, the SPS system dem using the described array of tec 90 bile visual-search systems can ac ognition accuracy, scale to re 88 databases, and deliver search r 86 ceptable time. 84 Send feature (CHoG) Emerging MPEG standard Send image (JPEG) As we have seen, key compo 82 Send feature (SIFT) gies for mobile visual search alr 80 we can choose among several p 100 101 102 tures to design such a system. W Query size (Kbytes) these options at the beginnin Figure 7. Comparison of different schemes with regard to classification The architecture shown in FigurGirod  et  al.  IEEE  MulVmedia  2011   Oge  Marques   est one to implement on a mobi accuracy and query size. CHoG descriptor data is an order of magnitude smaller compared to JPEG images or uncompressed SIFT descriptors. requires fast networks such as W good performance. The archite
  63. 63. achieve ,1 s server processing latency while maintaining high recall. 14 Communication Time-Out (%) Stanford Product Search System 12 TRANSMISSION DELAY The transmission delay depends on the type of network used. 10 In Figure 10, we observe that the data transmission time is 8 •  Performance evaluation insignificant for a WLAN network because of the high 6 –  Processing times 4 [TABLE 3] PROCESSING TIMES. 2 CLIENT-SIDE OPERATIONS TIME ( S) 0 IMAGE CAPTURE 1–2 0 FEATURE EXTRACTION AND COMPRESSION 1–1.5 (FOR SEND IMAGE MODE) SERVER-SIDE OPERATIONS TIME ( MS) FEATURE EXTRACTION 100 [FIG9] Measured (FOR SEND IMAGE MODE) VT MATCHING 100 percentage (b) f FAST GEOMETRIC RERANKING (PER IMAGE) 0.46 3G network. Ind GV (PER IMAGE) 30 Indoor (II) is test tested outside o IEEE SIGNAL PROCESSING MAGAZINE [72] JULYGirod  et  al.  IEEE  MulVmedia  2011  Tsai  et  al.  ACM  MM  2010   Oge  Marques  
  64. 64. ognition accuracy, sca Classifica 88 databases, and deliver s 86 ceptable time. Stanford Product Search System 84 Send feature (CHoG) Emerging MPEG stan Send image (JPEG) As we have seen, key 82 Send feature (SIFT) gies for mobile visual se 80 we can choose among se 100 101 102 tures to design such a s •  Performance evaluation Query size (Kbytes) these options at the b Figure 7. Comparison of different schemes with regard to classification The architecture shown accuracy and query size. CHoG descriptor data is an order of magnitude est one to implement on –  End-to-end latency smaller compared to JPEG images or uncompressed SIFT descriptors. requires fast networks su good performance. The Figure 2b reduces netwo 12 fast response over tod Feature extraction Network transmission requires descriptors to 10 Retrieval phone. Many applicatio further by using a cache phone, as exemplified Response time (seconds) 8 shown in Figure 2c. However, this imme tion of interoperability 6 mobile visual search app across a broad range of d the information is exc 4 compressed visual de images? This question w ing the Workshop on 2 held at Stanford Univer This discussion led to a US delegation to MPEG, 0 JPEG Feature Feature JPEG Feature tential interest in a stan (3G) (3G) progressive (WLAN) (WLAN) applications be explore (3G) ploratory activity in MP Figure 8. End-to-end latency for different schemes. Compared to a system produced a series of doGirod  et  al.  IEEE  MulVmedia  2011   Oge  Marques   transmitting a JPEG query image, a scheme employing progressive quent year describing a transmission of CHoG features achieves approximately four times the objectives, scope, and re reduction in system latency over a 3G network. standard.17
  65. 65. Examples of commercial MVS apps •  Google Goggles –  Android and iPhone –  Narrow- domain search and retrieval hp://     Oge  Marques  
  66. 66. SnapTell •  One of the earliest (ca. 2008) MVS apps for iPhone –  Eventually acquired by Amazon (A9) •  Proprietary technique (“highly accurate and robust algorithm for image matching: Accumulated Signed Gradient (ASG)”). hp://     Oge  Marques  
  67. 67. oMoby (and the IQ Engines API) –  iPhone app hp://     Oge  Marques  
  68. 68. oMoby (and the IQ Engines API) •  The IQ Engines API: “vision as a service” hp://     Oge  Marques  
  69. 69. The IQEngines API demo app •  Screenshots Oge  Marques  
  70. 70. The IQEngines API demo app •  XML-formatted response Oge  Marques  
  71. 71. Kooaba: Déjà Vu and Paperboy •  “Image recognition in the cloud” platform hp://     Oge  Marques  
  72. 72. Kooaba: Déjà Vu and Paperboy •  Déjà Vu •  Paperboy –  Enhanced digital –  News sharing from memories / notes / printed media journal hp://  hp://     Oge  Marques      
  73. 73. pixlinQ •  A “mobile visual search solution that enables you to link users to digital content whenever they take a mobile picture of your printed materials.” –  Powered by image recognition from LTU technologies hp://     Oge  Marques  
  74. 74. pixlinQ •  Example app (La Redoute) hp://     Oge  Marques  
  75. 75. Moodstocks •  Real-time mobile image recognition that works offline (!) •  API and SDK available hp://     Oge  Marques  
  76. 76. Moodstocks •  Many successful apps for different platforms hp://     Oge  Marques  
  77. 77. Concluding thoughts
  78. 78. Concluding thoughts •  Mobile Visual Search (MVS) is coming of age. •  This is not a fad and it can only grow. •  Still a good research topic –  Many relevant technical challenges –  MPEG efforts have just started •  Infinite creative commercial possibilities Oge  Marques  
  79. 79. Side note •  The power of Twitter… Oge  Marques  
  80. 80. Thanks! •  Questions? •  For additional information: Oge  Marques