Lec-07: Feature Aggregation and Image Retrieval System [notes]
Image retrieval system performance metrics, precision, recall, true positive rate, false positive rate; Bag of Words (BoW) and VLAD aggregation.
Lec-17: Sparse Signal Processing & Applications [notes]
Sparse signal processing, recovery of sparse signal via L1 minimization. Applications including face recognition, coupled dictionary learning for image super-resolution.
The document discusses subspace indexing on Grassmannian manifolds for large scale visual identification. It proposes using local subspace models built on neighborhoods defined by queries, but notes issues with computational complexity and lack of optimality. It then introduces Grassmannian and Stiefel manifolds to characterize subspace similarity and define distances. A model hierarchical tree is proposed to index subspaces through iterative merging based on distances on the Grassmannian manifold.
Image Retrieval with Fisher Vectors of Binary Features (MIRU'14)Yusuke Uchida
Recently, the Fisher vector representation of local features has attracted much attention because of its effectiveness in both image classification and image retrieval. Another trend in the area of image retrieval is the use of binary feature such as ORB, FREAK, and BRISK. Considering the significant performance improvement in terms of accuracy in both image classification and retrieval by the Fisher vector of continuous feature descriptors, if the Fisher vector were also to be applied to binary features, we would receive the same benefits in binary feature based image retrieval and classification. In this paper, we derive the closed-form approximation of the Fisher vector of binary features which are modeled by the Bernoulli mixture model. In experiments, it is shown that the Fisher vector representation improves the accuracy of image retrieval by 25% compared with a bag of binary words approach.
http://imatge-upc.github.io/retrieval-2017-cam/
Image retrieval in realistic scenarios targets large dynamic datasets of unlabeled images. In these cases, training or fine-tuning a model every time new images are added to the database is neither efficient nor scalable.
Convolutional neural networks trained for image classification over large datasets have been proven effective feature extractors when transferred to the task of image retrieval. The most successful approaches are based on encoding the activations of convolutional layers as they convey the image spatial information. Our proposal goes beyond and aims at a local-aware encoding of these features depending on the predicted image semantics, with the advantage of using only of the knowledge contained inside the network.
In particular, we employ Class Activation Maps (CAMs) to obtain the most discriminative regions from a semantic perspective. Additionally, CAMs are also used to generate object proposals during an unsupervised re-ranking stage after a first fast search.
Our experiments on two public available datasets for instance retrieval, Oxford5k and Paris6k, demonstrate that our system is competitive and even outperforms the current state-of-the-art when using off-the-shelf models trained on the object classes of ImageNet.
Lec-17: Sparse Signal Processing & Applications [notes]
Sparse signal processing, recovery of sparse signal via L1 minimization. Applications including face recognition, coupled dictionary learning for image super-resolution.
The document discusses subspace indexing on Grassmannian manifolds for large scale visual identification. It proposes using local subspace models built on neighborhoods defined by queries, but notes issues with computational complexity and lack of optimality. It then introduces Grassmannian and Stiefel manifolds to characterize subspace similarity and define distances. A model hierarchical tree is proposed to index subspaces through iterative merging based on distances on the Grassmannian manifold.
Image Retrieval with Fisher Vectors of Binary Features (MIRU'14)Yusuke Uchida
Recently, the Fisher vector representation of local features has attracted much attention because of its effectiveness in both image classification and image retrieval. Another trend in the area of image retrieval is the use of binary feature such as ORB, FREAK, and BRISK. Considering the significant performance improvement in terms of accuracy in both image classification and retrieval by the Fisher vector of continuous feature descriptors, if the Fisher vector were also to be applied to binary features, we would receive the same benefits in binary feature based image retrieval and classification. In this paper, we derive the closed-form approximation of the Fisher vector of binary features which are modeled by the Bernoulli mixture model. In experiments, it is shown that the Fisher vector representation improves the accuracy of image retrieval by 25% compared with a bag of binary words approach.
http://imatge-upc.github.io/retrieval-2017-cam/
Image retrieval in realistic scenarios targets large dynamic datasets of unlabeled images. In these cases, training or fine-tuning a model every time new images are added to the database is neither efficient nor scalable.
Convolutional neural networks trained for image classification over large datasets have been proven effective feature extractors when transferred to the task of image retrieval. The most successful approaches are based on encoding the activations of convolutional layers as they convey the image spatial information. Our proposal goes beyond and aims at a local-aware encoding of these features depending on the predicted image semantics, with the advantage of using only of the knowledge contained inside the network.
In particular, we employ Class Activation Maps (CAMs) to obtain the most discriminative regions from a semantic perspective. Additionally, CAMs are also used to generate object proposals during an unsupervised re-ranking stage after a first fast search.
Our experiments on two public available datasets for instance retrieval, Oxford5k and Paris6k, demonstrate that our system is competitive and even outperforms the current state-of-the-art when using off-the-shelf models trained on the object classes of ImageNet.
Lec-16: Subspace/Transform Optimization
Address the non-linearity issues in appearance manifolds by having a piece-wise linear solution. Query driven local model learning, subspace indexing on Grassmann manifold, direct Newtonian method of subspace optimization on Grassmann manifold.
Object Detection Beyond Mask R-CNN and RetinaNet IIIWanjin Yu
This document provides an overview of fine-grained image analysis. It begins with background on computer vision, deep learning, and traditional image recognition/retrieval. It then introduces fine-grained image analysis, distinguishing it from generic image recognition through examples. Challenges of fine-grained analysis are discussed, including small inter-class variance and large intra-class variance. Real-world applications of fine-grained analysis are presented across domains like species identification.
Slides by Albert Jimenez about the following paper:
Gordo, Albert, Jon Almazan, Jerome Revaud, and Diane Larlus. "Deep Image Retrieval: Learning global representations for image search." arXiv preprint arXiv:1604.01325 (2016).
We propose a novel approach for instance-level image retrieval. It produces a global and compact fixed-length representation for each image by aggregating many region-wise descriptors. In contrast to previous works employing pre-trained deep networks as a black box to produce features, our method leverages a deep architecture trained for the specific task of image retrieval. Our contribution is twofold: (i) we introduce a ranking framework to learn convolution and projection weights that are used to build the region features; and (ii) we employ a region proposal network to learn which regions should be pooled to form the final global descriptor. We show that using clean training data is key to the success of our approach. To that aim, we leverage a large scale but noisy landmark dataset and develop an automatic cleaning approach. The proposed architecture produces a global image representation in a single forward pass. Our approach significantly outperforms previous approaches based on global descriptors on standard datasets. It even surpasses most prior works based on costly local descriptor indexing and spatial verification. We intend to release our pre-trained model.
This document discusses techniques for instance search using convolutional neural network features. It presents two papers by the author on this topic. The first paper uses bags-of-visual-words to encode convolutional features for scalable instance search. The second paper explores using region-level features from Faster R-CNN models for instance search and compares different fine-tuning strategies. The document outlines the methodology, experiments on standard datasets, and conclusions from both papers.
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Mask R-CNN extends Faster R-CNN by adding a branch for predicting segmentation masks in parallel with bounding box recognition and classification. It introduces a new layer called RoIAlign to address misalignment issues in the RoIPool layer of Faster R-CNN. RoIAlign improves mask accuracy by 10-50% by removing quantization and properly aligning extracted features. Mask R-CNN runs at 5fps with only a small overhead compared to Faster R-CNN.
An image histogram represents the distribution of pixel intensities in a digital image. It plots the number of pixels for each tonal value. Histograms can reveal if an image is under-exposed or over-exposed based on where most pixel values are concentrated. Histogram equalization improves contrast by spreading out pixel values across intensity levels. Local histogram equalization applies this within neighborhoods to enhance detail while preserving edges.
Deep image retrieval - learning global representations for image search - ub ...Universitat de Barcelona
This document summarizes a research paper on deep image retrieval using global image representations. It presents three key ideas: 1) A siamese network trained with a triplet loss to learn image representations optimized for retrieval. 2) Replacing rigid region grids with a region proposal network to localize regions of interest. 3) Experiments showing their method outperforms classification features and achieves state-of-the-art results on standard retrieval datasets. Their work demonstrates an effective and scalable approach to image retrieval based on learning compact global image signatures.
This document summarizes an improved fixed point method for image restoration. It begins by introducing the problem of image restoration and describing common image degradation models. It then presents the traditional fixed point method using Tikhonov regularization. The improved method proposes changing the regularization parameter from larger to smaller values across iterations. Experimental results show the improved method performs better than other popular algorithms at solving motion and Gaussian degradation, producing reconstructed images with less noise and more detail.
An image histogram represents the distribution of pixel intensities in a digital image. It plots the number of pixels for each tonal value. Histograms can reveal if an image is under-exposed or over-exposed based on where most pixel values are concentrated. Histogram equalization improves contrast by spreading out pixel values across intensity levels. Local histogram equalization applies this within neighborhoods to enhance detail while preserving edges.
The retrieval algorithms in remote sensing generally involve complex physical forward models that are nonlinear and computationally expensive to evaluate. Statistical emulation provides an alternative with cheap computation and can be used to calibrate model parameters and to improve computational efficiency of the retrieval algorithms. We introduce a framework of combining dimension reduction of input and output spaces and Gaussian process emulation technique. The functional principal component analysis (FPCA) is chosen to reduce to the output space of thousands of dimensions by orders of magnitude. In addition, instead of making restrictive assumptions regarding the correlation structure of the high-dimensional input space, we identity and exploit the most important directions of this space and thus construct a Gaussian process emulator with feasible computation. We will present preliminary results obtained from applying our method to OCO-2 data, and discuss how our framework can be generalized in distributed systems.
https://telecombcn-dl.github.io/dlmm-2017-dcu/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Optimized linear spatial filters implemented in FPGAIOSRJVSP
Linear spatial filters (LSF) are used for filtering of digital images with the purpose of blurring, noise reduction, detail enhancement etc. The realization of LSF confronts the capital problem of a lot of operations needed for their computation. In this paper, described is an approach for optimizing of LSF by utilizing parallel algorithms and their hardware implementation on FPGA. A model and an algorithm based on partial sums and aimed at calculating the filtered pixels are presented. Defined are criteria for comparing of the different types of linear filters. A schematic diagram of an FPGA-based DSP operational block is shown. VHDL is utilized for the hardware design. Conducted are studies focused on comparing the partial sums and the non-partial sums based methods of filtering. It is ascertained that the methods employing partial sums reduce the number of operations to the size of the window (3, 5, 7,...) . These FPGA-based LSF are suitable for applications using threshold detecting, edge detection or image detail enhancement.
Digital Image Processing: Image Enhancement in the Spatial DomainMostafa G. M. Mostafa
This document discusses various image enhancement techniques in the spatial domain, including point operations, histogram equalization, and spatial filtering. Point operations include transformations like thresholding, negatives, power-law and gamma corrections that manipulate individual pixel intensities. Histogram equalization improves contrast by spreading out the most frequent intensity values. Spatial filtering techniques like smoothing, sharpening and edge detection use small filters to modify pixel values based on neighboring areas.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
This document provides an overview of digital image fundamentals and operations. It defines what a digital image is, how it is represented as a matrix, and common image types like RGB, grayscale, and binary. Pixels, resolution, neighborhoods, and basic relationships between pixels are discussed. The document also covers different types of image operations including point, local, and global operations as well as examples like arithmetic, logical, and geometric transformations. Finally, it introduces concepts of linear and nonlinear operations and announces the topic of the next lecture on image enhancement in the spatial domain.
We present Graph Convolutional Networks that, unlike classic DL models, allow supervised learning by exploiting both the single node features and its relationships with the others within the network.
RP BASED OPTIMIZED IMAGE COMPRESSING TECHNIQUEprj_publication
The document describes an optimized technique for compressing color images using colorization-based coding.
[1] Colorization-based coding works by extracting representative pixels (RP) from an original color image that contain color information, and using colorization to restore the full color image at the decoder.
[2] Previous methods obtained redundant RPs and did not remove unnecessary ones. The presented technique formulates RP extraction as an optimization problem (L1 minimization) to obtain a sparse set of high-quality RPs.
[3] A colorization matrix is constructed using multiscale mean-shift clustering of the luminance channel. The RP set is then extracted by solving the optimization problem using this matrix
Invited talk at USTC and SJTU, discuss recent progress in object re-identification against very large repository, especially the problem of fast key point detection, feature repeatability prediction, aggregation, and object repository indexing and search.
This document summarizes a lecture on rate-distortion optimization in video coding. It discusses various rate-distortion optimization problems including operational rate-distortion theory, joint source-channel coding optimization, storage constraint allocation, delay-constrained allocation, buffer-constrained allocation, and the multi-user problem. It also covers convex optimization techniques like the Lagrangian method that can help solve some of these optimization problems.
Lec-16: Subspace/Transform Optimization
Address the non-linearity issues in appearance manifolds by having a piece-wise linear solution. Query driven local model learning, subspace indexing on Grassmann manifold, direct Newtonian method of subspace optimization on Grassmann manifold.
Object Detection Beyond Mask R-CNN and RetinaNet IIIWanjin Yu
This document provides an overview of fine-grained image analysis. It begins with background on computer vision, deep learning, and traditional image recognition/retrieval. It then introduces fine-grained image analysis, distinguishing it from generic image recognition through examples. Challenges of fine-grained analysis are discussed, including small inter-class variance and large intra-class variance. Real-world applications of fine-grained analysis are presented across domains like species identification.
Slides by Albert Jimenez about the following paper:
Gordo, Albert, Jon Almazan, Jerome Revaud, and Diane Larlus. "Deep Image Retrieval: Learning global representations for image search." arXiv preprint arXiv:1604.01325 (2016).
We propose a novel approach for instance-level image retrieval. It produces a global and compact fixed-length representation for each image by aggregating many region-wise descriptors. In contrast to previous works employing pre-trained deep networks as a black box to produce features, our method leverages a deep architecture trained for the specific task of image retrieval. Our contribution is twofold: (i) we introduce a ranking framework to learn convolution and projection weights that are used to build the region features; and (ii) we employ a region proposal network to learn which regions should be pooled to form the final global descriptor. We show that using clean training data is key to the success of our approach. To that aim, we leverage a large scale but noisy landmark dataset and develop an automatic cleaning approach. The proposed architecture produces a global image representation in a single forward pass. Our approach significantly outperforms previous approaches based on global descriptors on standard datasets. It even surpasses most prior works based on costly local descriptor indexing and spatial verification. We intend to release our pre-trained model.
This document discusses techniques for instance search using convolutional neural network features. It presents two papers by the author on this topic. The first paper uses bags-of-visual-words to encode convolutional features for scalable instance search. The second paper explores using region-level features from Faster R-CNN models for instance search and compares different fine-tuning strategies. The document outlines the methodology, experiments on standard datasets, and conclusions from both papers.
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Mask R-CNN extends Faster R-CNN by adding a branch for predicting segmentation masks in parallel with bounding box recognition and classification. It introduces a new layer called RoIAlign to address misalignment issues in the RoIPool layer of Faster R-CNN. RoIAlign improves mask accuracy by 10-50% by removing quantization and properly aligning extracted features. Mask R-CNN runs at 5fps with only a small overhead compared to Faster R-CNN.
An image histogram represents the distribution of pixel intensities in a digital image. It plots the number of pixels for each tonal value. Histograms can reveal if an image is under-exposed or over-exposed based on where most pixel values are concentrated. Histogram equalization improves contrast by spreading out pixel values across intensity levels. Local histogram equalization applies this within neighborhoods to enhance detail while preserving edges.
Deep image retrieval - learning global representations for image search - ub ...Universitat de Barcelona
This document summarizes a research paper on deep image retrieval using global image representations. It presents three key ideas: 1) A siamese network trained with a triplet loss to learn image representations optimized for retrieval. 2) Replacing rigid region grids with a region proposal network to localize regions of interest. 3) Experiments showing their method outperforms classification features and achieves state-of-the-art results on standard retrieval datasets. Their work demonstrates an effective and scalable approach to image retrieval based on learning compact global image signatures.
This document summarizes an improved fixed point method for image restoration. It begins by introducing the problem of image restoration and describing common image degradation models. It then presents the traditional fixed point method using Tikhonov regularization. The improved method proposes changing the regularization parameter from larger to smaller values across iterations. Experimental results show the improved method performs better than other popular algorithms at solving motion and Gaussian degradation, producing reconstructed images with less noise and more detail.
An image histogram represents the distribution of pixel intensities in a digital image. It plots the number of pixels for each tonal value. Histograms can reveal if an image is under-exposed or over-exposed based on where most pixel values are concentrated. Histogram equalization improves contrast by spreading out pixel values across intensity levels. Local histogram equalization applies this within neighborhoods to enhance detail while preserving edges.
The retrieval algorithms in remote sensing generally involve complex physical forward models that are nonlinear and computationally expensive to evaluate. Statistical emulation provides an alternative with cheap computation and can be used to calibrate model parameters and to improve computational efficiency of the retrieval algorithms. We introduce a framework of combining dimension reduction of input and output spaces and Gaussian process emulation technique. The functional principal component analysis (FPCA) is chosen to reduce to the output space of thousands of dimensions by orders of magnitude. In addition, instead of making restrictive assumptions regarding the correlation structure of the high-dimensional input space, we identity and exploit the most important directions of this space and thus construct a Gaussian process emulator with feasible computation. We will present preliminary results obtained from applying our method to OCO-2 data, and discuss how our framework can be generalized in distributed systems.
https://telecombcn-dl.github.io/dlmm-2017-dcu/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Optimized linear spatial filters implemented in FPGAIOSRJVSP
Linear spatial filters (LSF) are used for filtering of digital images with the purpose of blurring, noise reduction, detail enhancement etc. The realization of LSF confronts the capital problem of a lot of operations needed for their computation. In this paper, described is an approach for optimizing of LSF by utilizing parallel algorithms and their hardware implementation on FPGA. A model and an algorithm based on partial sums and aimed at calculating the filtered pixels are presented. Defined are criteria for comparing of the different types of linear filters. A schematic diagram of an FPGA-based DSP operational block is shown. VHDL is utilized for the hardware design. Conducted are studies focused on comparing the partial sums and the non-partial sums based methods of filtering. It is ascertained that the methods employing partial sums reduce the number of operations to the size of the window (3, 5, 7,...) . These FPGA-based LSF are suitable for applications using threshold detecting, edge detection or image detail enhancement.
Digital Image Processing: Image Enhancement in the Spatial DomainMostafa G. M. Mostafa
This document discusses various image enhancement techniques in the spatial domain, including point operations, histogram equalization, and spatial filtering. Point operations include transformations like thresholding, negatives, power-law and gamma corrections that manipulate individual pixel intensities. Histogram equalization improves contrast by spreading out the most frequent intensity values. Spatial filtering techniques like smoothing, sharpening and edge detection use small filters to modify pixel values based on neighboring areas.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
This document provides an overview of digital image fundamentals and operations. It defines what a digital image is, how it is represented as a matrix, and common image types like RGB, grayscale, and binary. Pixels, resolution, neighborhoods, and basic relationships between pixels are discussed. The document also covers different types of image operations including point, local, and global operations as well as examples like arithmetic, logical, and geometric transformations. Finally, it introduces concepts of linear and nonlinear operations and announces the topic of the next lecture on image enhancement in the spatial domain.
We present Graph Convolutional Networks that, unlike classic DL models, allow supervised learning by exploiting both the single node features and its relationships with the others within the network.
RP BASED OPTIMIZED IMAGE COMPRESSING TECHNIQUEprj_publication
The document describes an optimized technique for compressing color images using colorization-based coding.
[1] Colorization-based coding works by extracting representative pixels (RP) from an original color image that contain color information, and using colorization to restore the full color image at the decoder.
[2] Previous methods obtained redundant RPs and did not remove unnecessary ones. The presented technique formulates RP extraction as an optimization problem (L1 minimization) to obtain a sparse set of high-quality RPs.
[3] A colorization matrix is constructed using multiscale mean-shift clustering of the luminance channel. The RP set is then extracted by solving the optimization problem using this matrix
Invited talk at USTC and SJTU, discuss recent progress in object re-identification against very large repository, especially the problem of fast key point detection, feature repeatability prediction, aggregation, and object repository indexing and search.
This document summarizes a lecture on rate-distortion optimization in video coding. It discusses various rate-distortion optimization problems including operational rate-distortion theory, joint source-channel coding optimization, storage constraint allocation, delay-constrained allocation, buffer-constrained allocation, and the multi-user problem. It also covers convex optimization techniques like the Lagrangian method that can help solve some of these optimization problems.
This document provides an overview of the Scale Invariant Feature Transform (SIFT) algorithm for feature detection and matching across images. It begins by introducing SIFT and its applications in computer vision. The document then outlines the key steps of the SIFT algorithm, including constructing scale space, approximating the Laplacian of Gaussian, finding keypoints, removing low-contrast keypoints, assigning orientations to keypoints, and generating SIFT features. Details are provided for each step, with examples to illustrate the process. The goal of SIFT is to detect features that are invariant to scale, rotation, illumination and viewpoint changes.
This document summarizes a presentation on augmenting descriptors for fine-grained visual categorization using polynomial embedding. The presentation introduces polynomial embedding as a method to exploit co-occurrence information between neighboring local descriptors. Polynomial embedding compresses polynomials of neighboring local feature vectors with supervised dimensionality reduction to obtain discriminative latent descriptors. Experiments on fine-grained categorization datasets show that polynomial embedding improves classification accuracy over baselines and state-of-the-art methods. However, the method is less effective for object and scene categorization problems.
SIFT is a method to automatically detect distinctive keypoints in images that are invariant to scale, orientation, and illumination changes. It works by identifying locations and scales that remain consistent across different views of the same object using scale-space analysis and rejecting unstable points. It then assigns a consistent orientation and creates a keypoint descriptor for each point based on local gradient orientations. These keypoints can then be used to reliably match different views of an object or scene.
Feature Matching using SIFT algorithm; co-authored presentation on Photogrammetry studio by Sajid Pareeth, Gabriel Vincent Sanya, Sonam Tashi and Michael Mutale
SIFT extracts distinctive invariant features from images to enable object recognition despite variations in scale, rotation, and illumination. The algorithm involves:
1) Constructing scale-space images from differences of Gaussians to identify keypoints.
2) Detecting stable local extrema across scales as candidate keypoints.
3) Filtering out low contrast keypoints and those poorly localized along edges.
4) Assigning orientations based on local gradient directions.
5) Computing descriptors by sampling gradients around keypoints for matching between images.
SIFT is a scale-invariant feature transform algorithm used to detect and describe local features in images. It detects keypoints that are invariant to scale, rotation, and partially invariant to illumination and viewpoint changes. The algorithm involves 4 main steps: (1) scale-space extrema detection, (2) keypoint localization, (3) orientation assignment, and (4) keypoint descriptor generation. SIFT descriptors provide a feature vector for each keypoint that is highly distinctive and partially invariant to remaining variations.
The document compares and summarizes different local feature descriptors, including SIFT. SIFT detects and describes local features to identify objects across images and is scale, rotation, and illumination invariant. It performs well compared to other descriptors like SURF, GLOH, and LESH under various image transformations such as rotation and blur, though it is computationally expensive. SURF is similar in performance to SIFT but faster, while GLOH and LESH also show good performance for shape-based tasks.
This document discusses image enhancement techniques in the spatial domain. It defines spatial domain processing as the direct manipulation of pixel values, as opposed to frequency domain processing which modifies the Fourier transform. The key techniques discussed are:
- Linear and non-linear transformations which map input pixel values to new output values.
- Spatial filters which operate on neighborhoods of pixels, including smoothing filters to reduce noise and sharpening filters to enhance edges.
- Histogram processing techniques like equalization to improve contrast in low contrast images.
The document provides examples of each technique and discusses their applications in image enhancement.
PCA-SIFT: A More Distinctive Representation for Local Image Descriptorswolf
PCA-SIFT is a modification of SIFT that uses principal component analysis (PCA) to build more distinctive local image descriptors. It constructs a projection matrix from a large set of image patches, then projects each keypoint descriptor through this matrix to a compact vector of the top n principal components. This provides a more discriminative representation than SIFT while reducing descriptor dimensionality, leading to improved matching accuracy and efficiency. Evaluation on controlled transformation and graffiti datasets shows PCA-SIFT achieves higher recall rates at equivalent or lower false positive rates compared to SIFT.
A COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVALcscpconf
Basic group of visual techniques such as color, shape, texture are used in Content Based Image Retrievals (CBIR) to retrieve query image or sub region of image to find similar images in image database. To improve query result, relevance feedback is used many times in CBIR to help user to express their preference and improve query results. In this paper, a new approach for image retrieval is proposed which is based on the features such as Color Histogram, Eigen Values and Match Point. Images from various types of database are first identified by using edge detection techniques .Once the image is identified, then the image is searched in the particular database, then all related images are displayed. This will save the retrieval time. Further to retrieve the precise query image, any of the three techniques are used and comparison is done w.r.t. average retrieval time. Eigen value technique found to be the best as compared with other two techniques.
A comparative analysis of retrieval techniques in content based image retrievalcsandit
Basic group of visual techniques such as color, shape, texture are used in Content Based Image
Retrievals (CBIR) to retrieve query image or sub region of image to find similar images in
image database. To improve query result, relevance feedback is used many times in CBIR to
help user to express their preference and improve query results. In this paper, a new approach
for image retrieval is proposed which is based on the features such as Color Histogram, Eigen
Values and Match Point. Images from various types of database are first identified by using
edge detection techniques .Once the image is identified, then the image is searched in the
particular database, then all related images are displayed. This will save the retrieval time.
Further to retrieve the precise query image, any of the three techniques are used and
comparison is done w.r.t. average retrieval time. Eigen value technique found to be the best as
compared with other two techniques.
The document discusses objective and subjective quality assessment methods for evaluating the rate-distortion performance of JPEG-XR image compression. It provides details on various objective metrics used to compare JPEG-XR to JPEG2000 and JPEG, including PSNR, SSIM, VIF, and others. It also proposes a methodology for subjective testing using a continuous quality scale to validate the results of objective metrics. Preliminary results are presented comparing different JPEG-XR implementations using the objective metrics.
This document discusses single object tracking and velocity determination. It begins with an introduction and objectives of the project which is to develop an algorithm for tracking a single object and determining its velocity in a sequence of video frames. It then provides details on preprocessing techniques like mean filtering, Gaussian smoothing and median filtering to reduce noise. It describes segmentation methods including histogram-based, single Gaussian background and frame difference approaches. Feature extraction methods like edges, bounding boxes and color are explained. Object detection using optical flow and block matching is covered. Finally, it discusses tracking and calculating velocity of the moving object. MATLAB is introduced as a technical computing language for solving these types of problems.
The document describes a pipeline for 3D object recognition and 6-DOF pose estimation from an RGB-D image. It involves generating synthetic views for training, extracting global features combining color and geometry, recognizing objects by matching features, and optimizing the estimated pose using ICP. The feature descriptor encodes relationships between point pairs and correlates color and geometry information across segmented regions. The approach achieves 94% recognition rate on online objects and 100% on synthetic views.
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...Ruairi de Frein
An article from the Telecommunications Software & Systems Group, Waterford Institute of Technology, Ireland describing algorithms for distributed Formal Concept Analysis
ABSTRACT
While many existing formal concept analysis algorithms are efficient, they are typically unsuitable for distributed implementation. Taking the MapReduce (MR) framework as our inspiration we introduce a distributed approach for performing formal concept mining. Our method has its novelty in that we use a light-weight MapReduce runtime called Twister which is better suited to iterative algorithms than recent distributed approaches. First, we describe the theoretical foundations underpinning our distributed formal concept analysis approach. Second, we provide a representative exemplar of how a classic centralized algorithm can be implemented in a distributed fashion using our methodology: we modify Ganter's classic algorithm by introducing a family of MR* algorithms, namely MRGanter and MRGanter+ where the prefix denotes the algorithm's lineage. To evaluate the factors that impact distributed algorithm performance, we compare our MR* algorithms with the state-of-the-art. Experiments conducted on real datasets demonstrate that MRGanter+ is efficient, scalable and an appealing algorithm for distributed problems.
Accepted for publication at the International Conference for Formal Concept Analysis 2012.
Project participants: Biao Xu, Ruairí de Fréin, Eric Robson, Mícheál Ó Foghlú
Ruairí de Fréin: rdefrein (at) gmail (dot) com
bibtex:
@incollection{
year={2012},
isbn={978-3-642-29891-2},
booktitle={Formal Concept Analysis},
volume={7278},
series={Lecture Notes in Computer Science},
editor={Domenach, Florent and Ignatov, DmitryI. and Poelmans, Jonas},
doi={10.1007/978-3-642-29892-9_26},
title={Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduce Framework},
url={http://dx.doi.org/10.1007/978-3-642-29892-9_26},
publisher={Springer Berlin Heidelberg},
keywords={Formal Concept Analysis; Distributed Mining; MapReduce},
author={Xu, Biao and Fréin, Ruairí and Robson, Eric and Ó Foghlú, Mícheál},
pages={292-308}
}
DOWNLOAD
The article Arxiv: http://arxiv.org/abs/1210.2401
3D Reconstruction from Multiple uncalibrated 2D Images of an ObjectAnkur Tyagi
3D reconstruction is the process of capturing the shape and appearance of real objects. In this project we are using passive methods which only use sensors to measure the radiance reflected or emitted by the objects surface to infer its 3D structure.
Computer vision,,summer training programmePraveen Pandey
This document outlines the topics covered in a course on computer vision and image processing using MATLAB. The course covers fundamentals of MATLAB, image processing techniques, thumb detection, fingerprint matching and project development. Specific topics include image types in MATLAB, image arithmetic, filtering, transforms, segmentation, texture analysis, noise removal, and compression. Morphology, discrete transforms, and color image processing are also discussed.
Query Image Searching With Integrated Textual and Visual Relevance Feedback f...IJERA Editor
There are many researchers who have studied the relevance feedback in the literature of content based image
retrieval (CBIR) community, but none of CBIR search engines support it because of scalability, effectiveness
and efficiency issues. In this, we had implemented an integrated relevance feedback for retrieving of web
images. Here, we had concentrated on integration of both textual features (TF) and visual features (VF) based
relevance feedback (RF), simultaneously we also tested them individually. The TFRF employs and effective
search result clustering (SRC) algorithm to get salient phrases. Then a new user interface (UI) is proposed to
support RF. Experimental results show that the proposed algorithm is scalable, effective and accurated
The document summarizes the ORB (Oriented FAST and Rotated BRIEF) feature detection and description algorithm. It begins by explaining how ORB improves on SIFT and SURF by combining the FAST keypoint detector with BRIEF descriptors to provide a method that is faster and has rotation invariance. It then describes the FAST detector, BRIEF descriptors, and how ORB adds orientation to BRIEF to achieve rotational invariance. Finally, it provides an overview of the full ORB algorithm and demonstrates its applications in areas like image matching, object recognition, and robot vision.
This document describes a summer internship project on digital image processing and analysis conducted by Rajarshi Roy at the Indian Institute of Engineering Science and Technology under the guidance of Dr. Samit Biswas from May to June 2016. It includes an acknowledgment, table of contents, abstract, and analysis of various digital image processing techniques applied to images, including reading and writing images, applying filters like negative, sharpening, edge detection, transposing the image matrix, stretching images, and applying mean filtering. The document provides details on the code developed in C++ to perform these image processing functions and analyze the results.
The document summarizes a proposed system for currency recognition on mobile phones. The system has the following modules: 1) segmentation to isolate the currency from background noise, 2) feature extraction and building a visual vocabulary, 3) instance retrieval using inverted indexing and spatial reranking, 4) classification by vote counting spatially consistent features. The system was adapted for mobile by reducing complexity, such as using an inverted index, while maintaining accuracy. Performance is evaluated using metrics like accuracy and precision.
Modelling User Interaction utilising Information Foraging Theory (and a bit o...Ingo Frommholz
The document discusses modelling user interaction with information using Information Foraging Theory and quantum theory. It summarizes research applying IFT to content-based image recommendation and query auto-completion in image search. A quantum-inspired model is proposed to represent user interaction as a state change in a Hilbert space, where subspaces represent queries, images, and image patches. User feedback projects the information need vector onto relevant subspaces, updating probabilities. This provides a framework for multimodal query auto-completion accounting for user interaction.
The paper proposes novel kernel descriptors for visual recognition based on gradient, color, and local binary pattern (shape) features. Kernel descriptors reduce granularity of pixel features and better capture image variations compared to existing methods like SIFT. Gradient kernel descriptor performed best on four datasets for image classification, outperforming SIFT and other methods. The descriptors provide a computationally feasible way to learn high-level visual features using kernel methods.
Semantics In Digital Photos A Contenxtual AnalysisAllenWu
Interpreting the semantics of an image is a hard problem. However, for storing and indexing large multimedia collections,
it is essential to build systems that can automatically extract semantics from images. In this research we show how we can fuse content and context to extract semantics from digital photographs. Our experiments show that if we can properly model context associated with media, we can interpret semantics using only a part of high dimensional content data.
Video Stitching using Improved RANSAC and SIFTIRJET Journal
1. The document discusses techniques for stitching multiple video frames into a panoramic video using Scale-Invariant Feature Transform (SIFT) and an improved RANSAC algorithm.
2. Key points and feature descriptors are extracted from frames using SIFT to find correspondences between frames. The improved RANSAC algorithm is used to estimate homography matrices between frames and filter outlier matches.
3. Frames are blended together to compensate for exposure differences and misalignments before being mapped to a reference plane to create the panoramic video mosaic. The algorithm aims to produce a high quality panoramic video in real-time.
This document describes a search engine for images that can take a query image and find similar images from a large database. The search engine uses wavelet transforms and k-means clustering to extract feature vectors from images and group them into regions. It then computes signatures for each image by combining regional features. Distances between query and database image signatures determine similarity. The method was implemented in MATLAB and tested on a database of 1000 images, producing results similar to human perception of image similarity.
Similar to Lec07 aggregation-and-retrieval-system (20)
This document summarizes a lecture on entropy coding and discusses Hoffman coding and Golomb coding. It begins with an overview of entropy, conditional entropy, and mutual information. It then explains Hoffman coding by describing the Hoffman coding procedure and properties like optimality. Golomb coding is also summarized, including the Golomb code construction and its advantages over unary coding. Implementation details are provided for Golomb encoding and decoding.
Introduction of info theory basis for image/video coding, especially, entropy, rate-distortion theory,
entropy coding, huffman coding, arithmetic coding
This document outlines the syllabus for a Multimedia Communication class taught by Zhu Li in spring 2016. The class will cover topics related to video coding standards, video compression techniques, and video networking. Students will complete homework assignments, two quizzes, and a project. The goal of the class is for students to understand multimedia compression theory and algorithms, and be able to apply their knowledge to solve real-world problems in media communication.
This document proposes a light weight video fingerprinting technique for verifying video playback in MPEG DASH streaming. It introduces differential eigen-appearance signatures to capture fingerprints of video sequences with high accuracy but low computational complexity. A simulation of the technique on 4000 advertisement clips streamed at different rates showed fingerprints can be computed at 200 bits/sec with less than 0.5% of decoding cost while achieving over 99% true positive rate and low false positive rates for playback verification. Future work will develop faster verification using binarized fingerprints and automatic token hashing.
This document presents a method for detecting and localizing video duplicates in large video repositories. It proposes modeling the duplicate likelihood using a Gaussian process that accounts for degradation between original and duplicate frames. It approximates the likelihood function using multi-indexed locality search to prune unlikely sequence matches. Simulation results on a 116 hour repository show the approach achieves high accuracy while scaling efficiently to large datasets. Future work aims to further improve efficiency to handle repositories with tens of thousands of hours of video.
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
This document provides an overview of wound healing, its functions, stages, mechanisms, factors affecting it, and complications.
A wound is a break in the integrity of the skin or tissues, which may be associated with disruption of the structure and function.
Healing is the body’s response to injury in an attempt to restore normal structure and functions.
Healing can occur in two ways: Regeneration and Repair
There are 4 phases of wound healing: hemostasis, inflammation, proliferation, and remodeling. This document also describes the mechanism of wound healing. Factors that affect healing include infection, uncontrolled diabetes, poor nutrition, age, anemia, the presence of foreign bodies, etc.
Complications of wound healing like infection, hyperpigmentation of scar, contractures, and keloid formation.
Gender and Mental Health - Counselling and Family Therapy Applications and In...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
3. Scale Space Theory - Lindeberg
Scale Space Response via Laplacian of Gaussian
The scale is controlled by 𝜎
Characteristic Scale:
Image Analysis & Retrieval, 2016 p.3
2
2
2
2
2
y
g
x
g
g
𝑔 = 𝑒
− 𝑥+𝑦 2
2𝜎
r
image
𝜎 = 0.8𝑟 𝜎 = 1.2𝑟 𝜎 = 2𝑟
…
characteristic
scale
4. SIFT
Use DoG to approximate LoG
Separable Gaussian filter
Difference of image instead of difference of Gaussian kernel
Image Analysis & Retrieval, 2016 p.4
L
o
G
Scale space construction
By Gaussian Filtering,
and Image Difference
5. Peak Strength & Edge Removal
Peak Strength:
Interpolate true DoG response and pixel location by Taylor
expansion
Edge Removal:
Re-do Harris type detection to remove edge on much reduced
pixel set
Image Analysis & Retrieval, 2016 p.5
6. Scale Invariance thru Dominant Orientation Coding
Voting for the dominant orientation
Weighted by a Gaussian window to give more emphasis to the
gradients closer to the center
Image Analysis & Retrieval, 2016 p.6
7. SIFT Matching and Repeatability Prediction
SIFT Distance
Not all SIFT are created equal…
Peak strength (DoG response at interpolated position)
Image Analysis & Retrieval, 2016 p.7
Combined scale/peak strength pmf
𝑑(𝑠1
1
, 𝑠 𝑘∗
2
)
𝑑(𝑠1
1
, 𝑠 𝑘
2
)
≤ 𝜃
8. Box Fitler – CABOX work
Basic Idea:
Approximate DoG with linear combination of box filters
min.
𝒉
𝒈 − 𝐵 ∙ 𝒉 𝐿2
2
+ 𝒉 𝐿1
Solution by LASSO
Image Analysis & Retrieval, 2016 p.8
= h1*
h2*+ + …
10. Image Matching/Retrieval System
SIFT is a sub-image level feature, we actually care
more on how SIFT match will translate into image level
matching/retrieval accuracy
Say if we can compute a single distance from a
collection of features:
Then for a data base of n images, we can compute an n
x n distance matrix
This gives us full information of the performance of this
feature/distance system
How to characterize the performance of such image matching
and retrieval system ?
Image Analysis & Retrieval, 2016 p.10
𝑑 𝐼1, 𝐼2 =
𝑘
𝛼 𝑘 𝑑(𝐹𝑘
1
, 𝐹𝑘
2
)
𝐷𝑖, 𝑘 = 𝑑(𝐼𝑗, 𝐼 𝑘)
11. Thresholding for Matching
Basically, for any pair of Images (documents, in IR
jargon), we declare
Then for each possible image pair, or pairs we care, for
a given threshold t, there will be 4 possible
consequences
TP pair: {Ij, Ik} declared matching pairs, d(Ij, Ik) < t;
FP pair: {Ij, Ik} declared matching pairs, d(Ij, Ik) >= t;
TN pair: {Ij, Ik} declared non-matching pairs, d(Ij, Ik) >= t;
FN pair: {Ij, Ik} declared non- matching pairs, d(Ij, Ik) < t;
Image Analysis & Retrieval, 2016 p.11
𝐼𝑗, 𝐼 𝑘 𝑎𝑟𝑒 𝑚𝑎𝑡𝑐ℎ, 𝑖𝑓 𝑑 𝐼𝑗, 𝐼 𝑘 < 𝑡
𝐼𝑗, 𝐼 𝑘 𝑎𝑟𝑒𝑛𝑜𝑡 𝑚𝑎𝑡𝑐ℎ, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
12. Matching System Performance
True Positive Rate/Precision:
Out of retrieved matching pairs, how many are true matching
pairs
For all matching pairs with distance < t
False Positive Rate:
Out of retrieved matching pairs, how many are actually
negative, false matchings
Image Analysis & Retrieval, 2016 p.12
𝑇𝑃𝑅 =
𝑡𝑝
𝑡𝑝 + 𝑓𝑛
𝐹𝑃𝑅 =
𝑓𝑝
𝑓𝑝 + 𝑡𝑛
13. TPR-FPR
Definition:
TP rate = TP/(TP+FN)
FP rate = FP/(FP+TN)
From the
actual value
point of view
Image Analysis & Retrieval, 2016 p.13
15. ROC curve(2)
Which method (A or B) is better?
compute ROC area: area under ROC
curve
Image Analysis & Retrieval, 2016 p.15
16. Precision, Recall, F-measure
Precision = TP/(TP + FP),
Recall = TP/(TP + FN)
F-measure = 2*(precision*recall)/(precision + recall)
Precision:
is the probability that a
retrieved document
is relevant.
Recall:
is the probability that a
relevant document
is retrieved in a search.
Image Analysis & Retrieval, 2016 p.16
17. Matlab Implementation
We will compute all image
pair distances D(j,k)
How do we compute the
TPR-FPR plot ?
Understand that TPR and
FPR are actually function of
threshold t,
Just need to parameterize
TPR(t) and FPR(t), and
obtaining operating points of
meaningful thresholds, to
generate the plot.
Matlab Implementation:
[tp, fp, tn,
fn]=getPrecisionRecall()
Image Analysis & Retrieval, 2016 p.17
d_min = min(min(d0), min(d1));
d_max = max(max(d0), max(d1));
delta = (d_max - d_min) / npt;
for k=1:npt
thres = d_min + (k-1)*delta;
tp(k) = length(find(d0<=thres));
fp(k) = length(find(d1<=thres));
tn(k) = length(find(d1>thres));
fn(k) = length(find(d0>thres));
end
if dbg
figure(22); grid on; hold on;
plot(fp./(tn+fp), tp./(tp+fn), '.-r',
'DisplayName', 'tpr-fpr');legend();
end
18. TPR-FPR
Image Matching performance are characterized by
functions
TPR(FPR)
Retrieval set: we want high Precision, Short List: High
Recall.
Image Analysis & Retrieval, 2016 p.18
20. Why Aggregation ?
What (Local) Interesting Points features bring us ?
Scale and rotation invariance in the form of nk x d:
Un-cerntainty of the number of detected features nk, at query
time
Permutation along rows of features are the same
representation.
Problems:
The feature has state, not able to draw decision boundaries,
Not directly indexable/hashable
Typically very high dimensionality
Image Analysis & Retrieval, 2016 p.20
𝑆 𝑘| [𝑥 𝑘, 𝑦 𝑘, 𝜃 𝑘, 𝜎 𝑘, ℎ1, ℎ2, … , ℎ128] , 𝑘 = 1. . 𝑛
21. Decision Boundary in Matching
Can we have a decision boundary function for
interesting points based representation ?
Image Analysis & Retrieval, 2016 p.21
…..
22. Curse of Dimensionality in Retrieval
What feature dimensions will do to the retrieval
efficiency…
Looking at retrieval 99% of per dimension locality, and the
total volume covered plot.
Matlab: showDimensionCurse.m
Image Analysis & Retrieval, 2016 p.22
+
23. Aggregation – 30,000ft view
Bag of Words
Compute k centroids in feature space, called visual words
Compute histogram
k x1 feature, hard assignment
VLAD
Compute centroids in feature space
Compute aggregaged difference w.r.t the centroids
k x d feature, soft assignment
Fisher Vector
Compute a Gaussian Mixture Model (GMM) with 2nd order info
Compute the aggregated feature w.r.t the mean and covariance of
GMM
2 x k x d feature
AKULA
Adaptive centroids and feature count
Improved with covariance ?
Image Analysis & Retrieval, 2016 p.23
0.5
0.4 0.05
0.05
24. Visual Key Words: main idea
Extract some local features from a number of
images …
Image Analysis & Retrieval, 2016 24
e.g., SIFT descriptor
space: each point is 128-
dimensional
Slide credit: D. Nister
25. Visual Key Words: main idea
Image Analysis & Retrieval, 2016 25Slide credit: D. Nister
26. Visual words: main idea
Image Analysis & Retrieval, 2016 26
Slide credit: D. Nister
27. Visual words: main idea
Image Analysis & Retrieval, 2016 27
Slide credit: D. Nister
28. Slide credit: D. Nister
Visual Key Words
Image Analysis & Retrieval, 2016 28
Each point is a local
descriptor, e.g. SIFT
vector.
30. Visual words
Example: each group of patches belongs to the
same visual word
Image Analysis & Retrieval, 2016 30
Figure from Sivic & Zisserman, ICCV 2003
31. Visual words
Image Analysis & Retrieval, 2016 31
31
Source credit: K. Grauman, B. Leibe
• More recently used for describing scenes and
objects for the sake of indexing or classification.
Sivic & Zisserman 2003;
Csurka, Bray, Dance, & Fan
2004; many others.
32. Object Bag of ‘words’
ICCV 2005 short course, L. Fei-Fei
Bag of Words
Image Analysis & Retrieval, 2016 32
34. Bags of visual words
Summarize entire image based on its distribution
(histogram) of word occurrences.
Analogous to bag of words representation
commonly used for documents.
Image Analysis & Retrieval, 2016 34
Image credit: Fei-Fei Li
36. BoW Distance Metrics
Rank images by normalized scalar product
between their (possibly weighted) occurrence
counts---nearest neighbor search for similar
images.
Image Analysis & Retrieval, 2016 p.36
[5 1 1 0][1 8 1 4]
dj
q
37. Inverted List
Image Retrieval via Inverted List
Image Analysis & Retrieval, 2016 37
Image credit: A. Zisserman
Visual
Word
number
List of image
numbers
When will this give us a significant gain in efficiency?
38. Indexing local features: inverted file index
For text documents, an
efficient way to find all pages
on which a word occurs is to
use an index…
We want to find all images in
which a feature occurs.
We need to index each
feature by the image it
appears and also we keep the
# of occurrence.
Image Analysis & Retrieval, 2016 38
Source credit : K. Grauman, B. Leibe
39. TF-IDF Weighting
Term Frequency – Inverse Document Frequency
Describe image by frequency of each visual word within
it, down-weight words that appear often in the database
(Standard weighting for text retrieval)
Image Analysis & Retrieval, 2016 p.39
Total number of
words in database
Number of
occurrences of
word i in whole
database
Number of
occurrences of
word i in
document d
Number of
words in
document d
40. BoW Use Case with Spatial Localization
Collecting words within a query region
Image Analysis & Retrieval, 2016 40
Query region:
pull out only the SIFT
descriptors whose
positions are within the
polygon
51. Vocabulary Tree: Performance
Evaluated on large databases
Indexing with up to 1M images
Online recognition for database
of 50,000 CD covers
Retrieval in ~1s
Find experimentally that large vocabularies can be
beneficial for recognition
Image Analysis & Retrieval, 2016 51
[Nister & Stewenius, CVPR’06]
52. Larger vocabularies
can be
advantageous…
But what happens if it
is too large?
Visual Word Vocabulary Size
Performance w.r.t vocabulary size
Image Analysis & Retrieval, 2016 52
53. Bags of words: pros and cons
Good:
+ flexible to geometry / deformations / viewpoint
+ compact summary of image content
+ provides vector representation for sets
+ Inverted List implementation offers practical solution
against large repository
Bad:
- Lost of information at quantization and histogram
generation
- basic model ignores geometry – must verify afterwards,
or encode via features
- background and foreground mixed when bag covers
whole image
- interest points or sampling: no guarantee to capture
object-level parts
Image Analysis & Retrieval, 2016 53Source credit : K. Grauman, B. Leibe
54. Can we improve BoW ?
• E.g. Why isn’t our Bag of Words classifier at 90%
instead of 70%?
• Training Data
– Huge issue, but not necessarily a variable you can manipulate.
• Learning method
– BoW is on top of any feature scheme
• Representation
– Are we losing too much info in the process ?
Image Analysis & Retrieval, 2016 p.54
55. Standard Kmeans Bag of Words
BoW revisited
Image Analysis & Retrieval, 2016 p.55
http://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf
56. Motivation
Bag of Visual Words is only about counting the number
of local descriptors assigned to each Voronoi region
Why not including other statistics/information ?
Image Analysis & Retrieval, 2016 p.56
http://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf
57. We already looked at the Spatial Pyramid/Pooling
Spatial Pooling
Image Analysis & Retrieval, 2016 p.57
level 2: 4x4level 0: 1x1 level 1: 2x2
Key take away: Multiple assignment ? Soft Assignment ?
58. Motivation
Bag of Visual Words is only about counting the number
of local descriptors assigned to each Voronoi region
Why not including other statistics? For instance:
• mean of local descriptors
Image Analysis & Retrieval, 2016 p.58
http://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf
59. Motivation
Bag of Visual Words is only about counting the number
of local descriptors assigned to each Voronoi region
Why not including other statistics? For instance:
• mean of local descriptors
• (co)variance of local descriptors
Image Analysis & Retrieval, 2016 p.59
http://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf
60. Simple case: Soft Assignment
Called “Kernel codebook encoding” by Chatfield et al.
2011. Cast a weighted vote into the most similar
clusters.
Image Analysis & Retrieval, 2016 p.60
61. Simple case: Soft Assignment
Called “Kernel codebook encoding” by Chatfield et al.
2011. Cast a weighted vote into the most similar
clusters.
This is fast and easy to implement (try it for Project 3!)
but it does have some downsides for image retrieval –
the inverted file index becomes less sparse.
Image Analysis & Retrieval, 2016 p.61
62. A first example: the VLAD
Given a codebook ,
e.g. learned with K-means, and a set of
local descriptors :
• assign:
• compute:
• concatenate vi’s + normalize
Image Analysis & Retrieval, 2016 p.62
Jégou, Douze, Schmid and Pérez, “Aggregating local descriptors into a compact image representation”, CVPR’10.
3
x
v1 v2
v3 v4
v5
1
4
2
5
① assign descriptors
② compute x- i
③ vi=sum x- i for cell i
63. A first example: the VLAD
A graphical representation of
Image Analysis & Retrieval, 2016 p.63
Jégou, Douze, Schmid and Pérez, “Aggregating local descriptors into a compact image representation”, CVPR’10.
64. VL_FEAT Implementation
Matlab:
Image Analysis & Retrieval, 2016 p.64
function [vc]=vladSiftEncoding(sift,
codebook)
dbg=1;
if dbg
if (0) % init VL_FEAT, only need
to do once
run('../../tools/vlfeat-
0.9.20/toolbox/vl_setup.m');
end
im = imread('../pics/flarsheim-
2.jpg');
[f, sift] =
vl_sift(single(rgb2gray(im))); sift =
single(sift');
[indx, codebook] = kmeans(sift,
16);
% make sift # smaller
sift = sift(1:800,:);
end
[n, kd]=size(sift);
[m, kd]=size(codebook);
% compute assignment
dist = pdist2(codebook, sift);
mdist = mean(mean(dist));
% normalize the heat kernel s.t. mean
dist is mapped to 0.5
a = -log(0.5)/mdist;
indx = exp(-a*dist);
vc=vl_vlad(sift', codebook', indx);
if dbg
figure(41); colormap(gray);
subplot(2,2,1); imshow(im);
title('image');
subplot(2,2,2); imagesc(dist);
title('m x n distance');
subplot(2,2,3); imagesc(indx);
title('m x n assignment');
subplot(2,2,4); imagesc(reshape(vc,
[m, kd]));title('vlad code');
end
65. VLAD Code
What are the tweaks ?
Code book design
Soft Assignment options
Image Analysis & Retrieval, 2016 p.65
66. References
Vocabulary Tree:
David Nistér, Henrik Stewénius: Scalable Recognition with a Vocabulary
Tree. CVPR (2) 2006: 2161-2168
VLAD:
Herve Jegou, Matthijs Douze, Cordelia Schmid:
Improving Bag-of-Features for Large Scale Image Search. International
Journal of Computer Vision 87(3): 316-336 (2010)
Fisher Vector:
Florent Perronnin, Jorge Sánchez, Thomas Mensink:
Improving the Fisher Kernel for Large-Scale Image Classification.
ECCV (4) 2010: 143-156
AKULA:
Abhishek Nagar, Zhu Li, Gaurav Srivastava, Kyungmo Park:
AKULA - Adaptive Cluster Aggregation for Visual Search. DCC 2014:
13-22
Image Analysis & Retrieval, 2016 p.66
67. Lec 07 Summary
Image Retrieval System Metric
What is true positive, false positive, true negative, false
negative ?
What is precision, recall, F-score ?
Why Aggregation ?
Decision boundary
Indexing/Hashing
Bag of Words
A histogram with bins visual words
Variations: hierarchical assignment with vocabulary tree
Implementation: Inverted List
VLAD
Richer encoding of aggregated info
Soft assignment of features to codebook bins
Vectorized representation – no need for inverted list
Image Analysis & Retrieval, 2016 p.67