Your SlideShare is downloading. ×
0
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Nokia Augmented Reality
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Nokia Augmented Reality

3,511

Published on

Radek.Grzeszczuk's presentation from the Aug. 24, 2009 SDForum Virtual World SIG in Palo Alto and online at http://www.virtualworldsig.com

Radek.Grzeszczuk's presentation from the Aug. 24, 2009 SDForum Virtual World SIG in Palo Alto and online at http://www.virtualworldsig.com

Published in: Technology, Education
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,511
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
9
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Only a limited number of different Huffman trees.Catalan number yields number of rooted binary trees (ordered leaves, no cross-overs)Count unique permutations
  • Winder, Brown (Microsoft Resarch), “Learning Local Image Descriptors,” 64x64 patches. touristphotographs of the Trevi Fountain and of Yosemite Valley (920 images), and a test set consisting of images ofNotre Dame (500 images). BoostSSC –Boosting Similarity Sensitive CodingG. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter sensitive hashing. In Proc. ICCV, 2003.Torralba et al., Small Codes and Large Image Databases for Recognition, CVPR2009.Random Projections - P. A. ChuohaoYeo and K. Ramchandran, “Rate-EfficientVisual Correspondences Using Random Projections,” 2008.
  • Most retrieval application require NN search in some formThe descriptors for both SIFT and CHoG were computed from the sameset of patches. VQ-5 bin configuration, GLOH-9 cell configurationsand Huffman Tree Coding are used for CHoG, resulting in a45 dimensional descriptor. We observe that exact nearest neighborsearching is 10X faster for CHoG. Furthermore, CHoG is still 2Xfaster than using SIFT with ANN eps = 1, which incurs a small errorrate of 0.30%. The speed up results from the lower dimensionalityof the CHoG descriptor, and the use of look up tables for fastdistance computation.
  • The scalable vocabulary tree is the data structure at the center of our recognition system. To construct an SVT, first we take every database CD cover and extract robust local features. These features can be SIFT, SURF, or your own favorite type. Then, all the feature descriptors from all the images are represented as vectors in a high-dimensional space. Here, they are shown as 2-dimensional vectors, but in reality, they can be 64-dimensional or 128-dimensional vectors.
  • The scalable vocabulary tree is the data structure at the center of our recognition system. To construct an SVT, first we take every database CD cover and extract robust local features. These features can be SIFT, SURF, or your own favorite type. Then, all the feature descriptors from all the images are represented as vectors in a high-dimensional space. Here, they are shown as 2-dimensional vectors, but in reality, they can be 64-dimensional or 128-dimensional vectors.
  • The scalable vocabulary tree is the data structure at the center of our recognition system. To construct an SVT, first we take every database CD cover and extract robust local features. These features can be SIFT, SURF, or your own favorite type. Then, all the feature descriptors from all the images are represented as vectors in a high-dimensional space. Here, they are shown as 2-dimensional vectors, but in reality, they can be 64-dimensional or 128-dimensional vectors.
  • To impose some structure on this space, we perform hierarchical k-means clustering, the first step of which is dividing the space into k clusters using regular k-means.
  • And then again, recursively splitting each large cluster into k smaller clusters. We repeat this process until the clusters become sufficiently small.What results from the hierarchical k-means algorithm is a tree structure, where tree nodes are the cluster centroids and their children are the subcluster centroids.
  • Here is the same tree as on the previous slide, except the tree structure is more apparent. Once we have constructed an SVT on a server, how to process an incoming query is straightforward. For every query descriptor, we classify it by traversing the SVT greedily from top to bottom. Suppose the first descriptor follows this nearest neighbor path. The SVT knows which database images have features associated with every node, so it votes for the two images found on this path. Both the blue nodes and green nodes vote, but since the blue nodes are more discriminative, their vote counts for more. Then, another query descriptor goes down a different path and votes for other images. And so on, until all the query descriptors are classified. The final vote tally is a histogram indicating how likely each database image is a match.We notice that when both the query and database images are fronto-parallel, the voting scheme works well and will select the correct database match. This is because similar features are extracted from the query image and the matching database image, leading to their descriptors visiting many of the same nodes in the SVT.
  • Performance drops with single tree, since nodes become less discriminative – fewer features are unique to a particular database image
  • Feature extraction is robust against rotation and scale change. NOT robust against foreshortening.Overcome by putting multiple examples into data base that show object from different angles.
  • One could put all these views into one vocabulary tree.Distributing views across parallel trees prevent competition among the among the features belonging to different views of the same object. Views compete only, once all the features are considered. Select the 25 top matches for each SVT based on bin count similarity, then find match with best geometric consistency.The multiview SVT approach is attractive for multi-core server, the search process through the different trees can be run in parallel
  • ICCV: Sept/Oct Kyoto
  • Reduce Database SizeIncrease Robustness
  • Transcript

    • 1. Stanford-Nokia CollaborationMobile Augmented RealityAugust 2009 Review<br /> Bernd Girod Radek Grzeszczuk<br /> Stanford University Nokia Research Center<br />
    • 2. Mobile Augmented Reality Team<br />Radek Grzeszczuk<br />Bernd Girod<br />Vijay Chandrasekhar<br />Gabriel Takacs<br />Wei-Chao Chen<br />Natasha Gelfand<br />Yingen Xiong<br />Kari Pulli<br />Sam Tsai<br />David Chen<br />Jana Kosecka<br />Ramakrishna Vedantham<br />Mina Makar<br />
    • 3. Outline<br />Review: landmark recognition system<br />Architecture: location-based pre-fetching and matching on the phone<br />Computer vision: “Bag of Words” matching<br />Feature compression for server-side matching<br />Approaches explored: Transform coding of features, patch compression<br />Compressible descriptor: CHoG (Compressed Histogram of Gradients)<br />Scalability for large data bases<br />From “Bags of Words” to “Vocabulary Trees” to “Vocabulary Forests”<br />Accuracy vs. data base size<br />Towards 3D<br />Multi-viewvocabulary trees<br />Matching against 3-d models<br />Summary and future directions<br />
    • 4. Outline<br />Review: landmark recognition system<br />Architecture: location-based pre-fetching and matching on the phone<br />Computer vision: “Bag of Words” matching<br />Feature compression for server-side matching<br />Approaches explored: Transform coding of features, patch compression<br />Compressible descriptor: CHoG (Compressed Histogram of Gradients)<br />Scalability for large data bases<br />From “Bags of Words” to “Vocabulary Trees” to “Vocabulary Forests”<br />Accuracy vs. data base size<br />Towards 3D<br />Multi-viewvocabulary trees<br />Matching against 3-d models<br />Summary and future directions<br />
    • 5. Mobile Visual Search<br />User takes picture<br />… chooses action …<br />…confirms POI <br />
    • 6. Mobile Visual Search Applications <br />Museum Guide<br />Tourist Guide<br />Landmarks<br />Wine Labels<br />Comparison Shopping<br />Ads/Catalogs<br />CDs/DVDs/Books<br />Movie Posters<br />
    • 7. GPS<br />Server<br />Landmark Recognition withFeature Matching on the Phone<br />Memorial Church<br />
    • 8. Prefetched Data<br />“Bag of Words” Matching<br />Query Image<br />Geometric<br />Consistency<br />Check<br />Feature<br />Descriptors<br />Feature<br />Correspondences<br />Database Images<br />
    • 9. Computing Visual Words<br />dx<br />dy<br />scale<br />SIFT Descriptor<br />SURF Descriptor<br />y<br />x<br />Σdx<br />Σdy<br />Σ|dx|<br />Σ|dy|<br />Σ<br />Σ<br />Σ<br />Σ<br />Σ<br />Σ<br />Σ<br />Σ <br />Color<br />Gray<br />Dxx<br />Σdx<br />Σdy<br />Σ|dx|<br />Σ|dy|<br />Maxima<br />Dxy<br />…<br />…<br />DxxDyy-(0.9Dxy)2<br />Σdx<br />Σdy<br />Σ|dx|<br />Σ|dy|<br />Dyy<br />Orient along <br />dominant gradient<br />Oriented Patch<br />Gradient Field<br />Filters<br />Blob Response<br />
    • 10. Matching Performance<br />~90 images/kernel<br />~90 images/kernel<br />~1000 images/kernel<br />True Matches<br />False Matches<br />
    • 11. Timing Analysis(Q2 2008)<br />Nokia N95<br />332 MHz ARM<br />64 MB RAM <br />100 KByte JPEG; uplink 60 Kbps<br />Downloads<br />Upload<br />Upload<br />Geometric<br />Consistency<br />Extract<br />Features<br />Extract<br />Features<br />Feature Matching<br />Extract Features <br />on Phone<br />All on Phone<br />All on Server<br />
    • 12. Outline<br />Review: landmark recognition system<br />Architecture: location-based pre-fetching and matching on the phone<br />Computer vision: “Bag of Words” matching<br />Feature compression for server-side matching<br />Approaches explored: Transform coding of features, patch compression<br />Compressible descriptor: CHoG (Compressed Histogram of Gradients)<br />Scalability for large data bases<br />From “Bags of Words” to “Vocabulary Trees” to “Vocabulary Forests”<br />Accuracy vs. data base size<br />Towards 3D<br />Multi-viewvocabulary trees<br />Matching against 3-d models<br />Summary and future directions<br />
    • 13. Advanced Feature Compression<br />Transform Coding of SIFT/SURF descriptors[Chandrasekhar et al., VCIP 09]<br />Direct compression of oriented image patch [M. Makar et al., ICASSP 09]<br />Descriptor designed for compressibility: CHoG[Chandrasekhar et al., CVPR 09]<br />Tree-Structured Vector QuantizationTree Histogram Coding [Chen et al., DCC 09]<br />Compression of Location Information[Tsai et al., Mobimedia 09]<br />
    • 14. Patch<br />CHoG: Compressed Histogram of Gradients<br />Gradient distributions<br />for each bin<br />Gradients<br />dx<br />dx<br />dx<br />dx<br />dx<br />dx<br />dx<br />dx<br />dy<br />dy<br />dy<br />dy<br />dy<br />dy<br />dy<br />dy<br />Spatial<br />binning<br />01101<br />101101<br />Histogram<br />compression<br />0100011<br />111001<br />0010011<br />01100<br />1010100<br />CHoGDescriptor<br />
    • 15. CHoG: Histogram Compression<br />0.46<br />1/2<br />0.21<br />1/4<br />0.46<br />0.16<br />1/8<br /> 0.09<br /> 0.08<br />1/16<br />1/16<br />0.21<br />Gradient distribution<br />0.08<br />0.16<br />0.09<br />Huffman treeapproximatesprobabilities<br />Gradient binning<br />
    • 16. Enumerating Huffman Trees<br />Rooted binary trees with nleaf nodes<br />
    • 17. Feature Matching Performance<br />Tree Structured Vector Quantizer<br />SURF Transform<br />Random<br />Projections<br />BoostSSC<br />Patch + SIFT<br />CHoG<br />SIFT Transform<br />Ground truth data setof matching patches<br />Descriptor Size (bits)<br />[Winder & Brown CVPR ’07]<br />
    • 18. Compressed Domain Matching<br />1 2 3 4 5 6 <br />1<br />2<br />3<br />4<br />5<br />6 <br />Dist(·)<br />Distance<br />Distance<br />Look-up table<br />Tree index<br />Gradient binning<br />Gradient distribution<br />
    • 19. Nearest Neighbor Search<br />372<br />Exact<br />ANN0.3 % errors<br />Exact<br />47<br />28<br />400<br />350<br />300<br />250<br />Query Time (sec)<br />200<br />150<br />100<br />50<br />0<br />SIFT<br />CHoG<br />106 database descriptors<br />103 query descriptors<br />
    • 20. Location Histogram Coding<br />Feature<br />Locations<br />(x,y)<br />Spatial<br />Binning<br />Context-based<br />Arithmetic Coding<br />-<br />Refinement Bits<br />Quantize<br />+<br />[Tsai et al., MobiMedia 2009]<br />
    • 21. Compressed Feature Vector<br />52<br />84<br />1024<br />1088<br />59<br />Size (bits)<br />SIFT<br />Location x,y<br />1088 bits<br />CHoG <br />Location x,y<br />~ 84 bits<br />Compressedx,yCHoG<br />~ 59 bits<br />[Tsai et al., MobiMedia 2009]<br />
    • 22. Outline<br />Review: landmark recognition system<br />Architecture: location-based pre-fetching and matching on the phone<br />Computer vision: “Bag of Words” matching<br />Feature compression for server-side matching<br />Approaches explored: Transform coding of features, patch compression<br />Compressible descriptor: CHoG (Compressed Histogram of Gradients)<br />Scalability for large data bases<br />From “Bags of Words” to “Vocabulary Trees” to “Vocabulary Forests”<br />Accuracy vs. data base size<br />Towards 3D<br />Multi-viewvocabulary trees<br />Matching against 3-d models<br />Summary and future directions<br />
    • 23. Pairwise Comparison<br />“Bag of Words” Matching & Affine Consistency Check<br />
    • 24. Growing Vocabulary Tree<br />[Nistér and Stewenius, 2006]<br />
    • 25. Growing Vocabulary Tree<br />[Nistér and Stewenius, 2006]<br />
    • 26. Growing Vocabulary Tree<br />[Nistér and Stewenius, 2006]<br />
    • 27. Growing Vocabulary Tree<br />k = 3<br />[Nistér and Stewenius, 2006]<br />
    • 28. k = 3<br />Growing Vocabulary Tree<br />[Nistér and Stewenius, 2006]<br />
    • 29. Querying Vocabulary Tree<br />Query<br />
    • 30. Recognition Accuracy<br />Forestof 6 trees<br />Recall (Percent)<br />Singlevocabulary<br />tree<br />Number of database images<br />
    • 31. Vocabulary Forest<br />SVT<br />Features<br />…<br />…<br />Image<br />…<br />Image<br />…<br />IFS<br />Count<br />…<br />Count<br />…<br />Early Termination<br />GCC<br />…<br />Combine Matches<br />
    • 32. Real-time System: Send Image<br />Image<br />Wireless<br />Network<br />Information<br />Server<br />VocTreeImage <br />Matching<br />Feature <br />Extraction<br />Camera<br />Client<br />
    • 33. Features<br />Wireless<br />Network<br />Information<br />Server<br />VocTree<br />Image<br />Matching<br />FeatureExtraction<br />Camera<br />Client<br />Coding<br />Real-time System: Send Features<br />
    • 34. Timing Analysis<br />Nokia N95<br />332 MHz ARM<br />64 MB RAM <br />Server Delay<br />Execution Time (sec)<br />Upload<br />Image<br />40 kByte<br />Server Delay<br />Upload Features<br />2.2 kByte<br />Extract Features<br />“Send Features” “Send Image”<br />
    • 35. Timing Analysis<br />Nokia N95<br />332 MHz ARM<br />64 MB RAM <br />Execution Time (sec)<br />Server Delay<br />Upload<br />Image<br />40 kByte<br />Server Delay<br />Upload 2.2 kByte<br />Extract Features<br />“Send Features” “Send Image”<br />
    • 36. Timing Analysis<br />Nokia N95<br />332 MHz ARM<br />64 MB RAM <br />Execution Time (sec)<br />Server Delay<br />Server Delay<br />Extract Features<br />“Send Features” “Send Image”<br />
    • 37. Streaming MAR<br />Server<br />Extract Features<br />Search K-D Tree<br />Check Geometry<br />Send Query Frame<br />Send ID and Geometry<br />Network<br />Low Motion<br />John Mayer<br />Inside Wants Out<br />Display ID and Draw Boundary<br />CompensateCamera Pose<br />Time<br />High Motion<br />Client<br />TrackCamera Pose<br />…<br />
    • 38. Outline<br />Review: landmark recognition system<br />Architecture: location-based pre-fetching and matching on the phone<br />Computer vision: “Bag of Words” matching<br />Feature compression for server-side matching<br />Approaches explored: Transform coding of features, patch compression<br />Compressible descriptor: CHoG (Compressed Histogram of Gradients)<br />Scalability for large data bases<br />From “Bags of Words” to “Vocabulary Trees” to “Vocabulary Forests”<br />Accuracy vs. data base size<br />Towards 3D<br />Multi-view vocabulary trees<br />City-scale landmark recognition using view invariant matching<br />Summary and future directions<br />
    • 39. Multiview Database<br />Front View Images<br />Top View Images<br />Bottom View Images<br />Right View Images<br />Left View Images<br />
    • 40. Multiview Vocabulary Trees<br />Left<br />Front<br />Top<br />Bottom<br />Right<br />Query Image<br />Select Top Matches<br />Select Top Matches<br />Select Top Matches<br />Select Top Matches<br />Select Top Matches<br />Geometric Consistency Check <br />Top Match<br />
    • 41. Multiview Matching Performance<br />Front SVT<br />Multiview SVTs<br />Image Recall<br />Match Rate <br />Query View<br />Query View<br />Top<br />Right<br />Bottom<br />Right<br />Front<br />Left<br />Top<br />Bottom<br />Front<br />Left<br />
    • 42. Compact Architectural Models from Geo-Registered Image Collections<br />GPS-tagged Images<br />Building Outline<br />Camera Poses Estimation<br />Robust Map Alignment<br />Efficient View<br />Selection<br />3D Model of Landmark<br />Unstructured Image Collections: Panoramio<br />Structured Image Collections: Street View data (Navteq)<br />[Grzeszczuk, 3DIM 2009]<br />
    • 43. View-Invariant Matching Pipeline<br />Feature<br />Store<br />Feature Extraction<br />Image<br />Database<br />Rectified<br />Database Images<br />Image Rectification using 3D Model<br />Feature Extraction<br />Matching<br />Results<br />Oblique<br />Query Image<br />Rectified<br />Query Image<br />Image Rectification using Vanishing Points<br />
    • 44. Outline<br />Review: landmark recognition system<br />Architecture: location-based pre-fetching and matching on the phone<br />Computer vision: “Bag of Words” matching<br />Feature compression for server-side matching<br />Approaches explored: Transform coding of features, patch compression<br />Compressible descriptor: CHoG (Compressed Histogram of Gradients)<br />Scalability for large data bases<br />From “Bags of Words” to “Vocabulary Trees” to “Vocabulary Forests”<br />Accuracy vs. data base size<br />Towards 3D<br />Multi-viewvocabulary trees<br />Matching against 3-d models<br />Summary and future directions<br />
    • 45. Research Directions<br />Research area: image features<br />Keypoint detection optimized for CHoG, prioritization<br />Comprehensive performance analysis of compressed feature matching<br />Next generation CHoG: soft kernels vs. hard binning, embedded, refinablebitstream<br />Beyond RANSAC: advanced geometry matching and coding, incorporate scale and orientation<br />Research area: image database/vocabulary trees<br />Optimum tree/forest growing, CHoG trees, incremental data base update<br />Fast query, early termination, distance metrics, scoring, nearest neighbor algorithms<br />Trees for phone implementation, inverted file caching, tree histogram coding<br />Research area: streaming mobile augmented reality<br />Camera pose estimation, feature tracking, temporally coherent feature extraction<br />Continuous recognition strategies, scheduling, latency minimization<br />Superposition of graphics information, motion compensation, occlusion handling<br />Research area: 3D modeling<br />Image matching pipeline using 3D models<br />Automatic image rectification, features from texture maps<br />Methods for integrating heterogeneous image sources<br />Demonstrate improved landmark recognition for large-scale urban scene<br />Collaboration with Marc Pollefeys, ETH Zurich<br />

    ×