The document summarizes techniques for finding correspondences between images, including local feature detection, descriptors, and matching. It discusses how local features can be used for large scale image retrieval applications. Local feature detectors aim to find repeatable keypoints that are robust to changes in viewpoint and occlusion. Descriptors are used to represent image patches around keypoints to enable matching. Techniques like bag-of-words models and geometric verification allow matching features across large databases to perform image search and retrieval.
The document discusses content-based image retrieval. It begins with an overview of the problem of using a query image to retrieve similar images from a large dataset. Common techniques discussed include using SIFT features with bag-of-words models or convolutional neural network (CNN) features. The document outlines the classic SIFT retrieval pipeline and techniques for using features from pre-trained CNNs, such as max-pooling features from convolutional layers or encoding them with VLAD. It also discusses learning image representations specifically for retrieval using methods like the triplet loss to learn an embedding space that clusters similar images. The state-of-the-art methods achieve the best performance by learning global or regional image representations from CNNs trained on large, generated datasets
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
最近の研究情勢についていくために - Deep Learningを中心に - Hiroshi Fukui
This document summarizes key developments in deep learning for object detection from 2012 onwards. It begins with a timeline showing that 2012 was a turning point, as deep learning achieved record-breaking results in image classification. The document then provides overviews of 250+ contributions relating to object detection frameworks, fundamental problems addressed, evaluation benchmarks and metrics, and state-of-the-art performance. Promising future research directions are also identified.
ClearGrasp is a method for estimating the 3D geometry of transparent objects from a single RGB-D image using a CNN architecture. It creates both synthetic and real datasets of transparent objects with surface normals, segmentation masks and depth information. The CNN takes an RGB image as input and outputs the surface normals, segmentation masks and occlusion boundaries. A global optimization method is then used to estimate depth from these outputs. The method achieves accurate 3D shape estimation and enables improved robot grasping of transparent objects compared to without using ClearGrasp.
2019年6月13日、SSII2019 Organized Session: Multimodal 4D sensing。エンドユーザー向け SLAM 技術の現在。登壇者:武笠 知幸(Research Scientist, Rakuten Institute of Technology)
https://confit.atlas.jp/guide/event/ssii2019/static/organized#OS2
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis taeseon ryu
해당 논문은 3D Aware 모델입니다 StyleGAN 같은 경우에는 어떤 하나의 피처에 대해서 Editing 하고 싶을 때 입력에 해당하는 레이턴트 백터를 찾아서 레이턴트 백터를 수정함으로써 입에 해당하는 피쳐를 바꿀 수 있었는데 이런 컨셉을 그대로 착안해서
GAN 스페이스 논문에서는 인풋이 들어왔을 때 어떤 공간적인 정보까지도 에디팅하려고 시도했습니다 결과를 봤을 때 로테이션 정보가 어느 정도 잘 학습된 것 같지만 같은 사람이 아닌 것 같이 인식되기도 합니다 이러한 문제를 이제 disentangle 되지 않았다라고 하는 게 원하는 피처만 변화시켜야 되는 것과 달리 다른 피처까지도 모두 학습 모두 변했다는 것인데 이를 좀 더 효율적으로 3D를 더 잘 이해시키기 위해서 탄생한 논문입니다.
Transformer Architectures in Vision
[2018 ICML] Image Transformer
[2019 CVPR] Video Action Transformer Network
[2020 ECCV] End-to-End Object Detection with Transformers
[2021 ICLR] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
The document discusses content-based image retrieval. It begins with an overview of the problem of using a query image to retrieve similar images from a large dataset. Common techniques discussed include using SIFT features with bag-of-words models or convolutional neural network (CNN) features. The document outlines the classic SIFT retrieval pipeline and techniques for using features from pre-trained CNNs, such as max-pooling features from convolutional layers or encoding them with VLAD. It also discusses learning image representations specifically for retrieval using methods like the triplet loss to learn an embedding space that clusters similar images. The state-of-the-art methods achieve the best performance by learning global or regional image representations from CNNs trained on large, generated datasets
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
最近の研究情勢についていくために - Deep Learningを中心に - Hiroshi Fukui
This document summarizes key developments in deep learning for object detection from 2012 onwards. It begins with a timeline showing that 2012 was a turning point, as deep learning achieved record-breaking results in image classification. The document then provides overviews of 250+ contributions relating to object detection frameworks, fundamental problems addressed, evaluation benchmarks and metrics, and state-of-the-art performance. Promising future research directions are also identified.
ClearGrasp is a method for estimating the 3D geometry of transparent objects from a single RGB-D image using a CNN architecture. It creates both synthetic and real datasets of transparent objects with surface normals, segmentation masks and depth information. The CNN takes an RGB image as input and outputs the surface normals, segmentation masks and occlusion boundaries. A global optimization method is then used to estimate depth from these outputs. The method achieves accurate 3D shape estimation and enables improved robot grasping of transparent objects compared to without using ClearGrasp.
2019年6月13日、SSII2019 Organized Session: Multimodal 4D sensing。エンドユーザー向け SLAM 技術の現在。登壇者:武笠 知幸(Research Scientist, Rakuten Institute of Technology)
https://confit.atlas.jp/guide/event/ssii2019/static/organized#OS2
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis taeseon ryu
해당 논문은 3D Aware 모델입니다 StyleGAN 같은 경우에는 어떤 하나의 피처에 대해서 Editing 하고 싶을 때 입력에 해당하는 레이턴트 백터를 찾아서 레이턴트 백터를 수정함으로써 입에 해당하는 피쳐를 바꿀 수 있었는데 이런 컨셉을 그대로 착안해서
GAN 스페이스 논문에서는 인풋이 들어왔을 때 어떤 공간적인 정보까지도 에디팅하려고 시도했습니다 결과를 봤을 때 로테이션 정보가 어느 정도 잘 학습된 것 같지만 같은 사람이 아닌 것 같이 인식되기도 합니다 이러한 문제를 이제 disentangle 되지 않았다라고 하는 게 원하는 피처만 변화시켜야 되는 것과 달리 다른 피처까지도 모두 학습 모두 변했다는 것인데 이를 좀 더 효율적으로 3D를 더 잘 이해시키기 위해서 탄생한 논문입니다.
Transformer Architectures in Vision
[2018 ICML] Image Transformer
[2019 CVPR] Video Action Transformer Network
[2020 ECCV] End-to-End Object Detection with Transformers
[2021 ICLR] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
PCA-SIFT: A More Distinctive Representation for Local Image Descriptorswolf
PCA-SIFT is a modification of SIFT that uses principal component analysis (PCA) to build more distinctive local image descriptors. It constructs a projection matrix from a large set of image patches, then projects each keypoint descriptor through this matrix to a compact vector of the top n principal components. This provides a more discriminative representation than SIFT while reducing descriptor dimensionality, leading to improved matching accuracy and efficiency. Evaluation on controlled transformation and graffiti datasets shows PCA-SIFT achieves higher recall rates at equivalent or lower false positive rates compared to SIFT.
This document discusses one-shot learning techniques for object recognition from few examples. It introduces the concepts of embedding spaces and similarity metrics for measuring distances between objects. Specific deep learning models are described, including Siamese networks, triplet networks, DeepFace, and FaceNet. Siamese networks aim to learn a similarity function using a contrastive loss over input pairs, while triplet networks employ a triplet loss to optimize relative distances between anchor, positive, and negative examples. DeepFace and FaceNet are state-of-the-art face recognition systems that use deep convolutional networks trained with triplet losses to learn embeddings that achieve human-level accuracy on benchmark face datasets.
1) The document discusses using data in deep learning models, including understanding the limitations of data and how it is acquired.
2) It describes techniques for image matching using multi-view geometry, including finding corresponding points across images and triangulating them to determine camera pose.
3) Recent works aim to improve localization of objects in images using multiple instance learning approaches that can learn without full supervision or through more stable optimization methods like linearizing sampling operations.
Recent Progress on Object Detection_20170331Jihong Kang
This slide provides a brief summary of recent progress on object detection using deep learning.
The concept of selected previous works(R-CNN series/YOLO/SSD) and 6 recent papers (uploaded to the Arxiv between Dec/2016 and Mar/2017) are introduced in this slide.
Most papers are focusing on improving the performance of small object detection.
This document proposes and evaluates several deep learning models for unsupervised monocular depth estimation. It begins with background on depth estimation methods and a literature review of recent work. Four depth estimation architectures are then described: EfficientNet-B7, EfficientNet-B3, DenseNet121, and DenseNet161. These models use an encoder-decoder structure with skip connections. An unsupervised loss function is adopted that combines appearance matching, disparity smoothness, and left-right consistency losses. The models are trained on the KITTI dataset and evaluated using standard KITTI metrics, showing improved performance over baseline methods using less training data and lower input resolution.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
3-d interpretation from single 2-d image IVYu Huang
This document summarizes several methods for monocular 3D object detection from a single 2D image for autonomous driving applications. It outlines methods that use pseudo-LiDAR representations, monocular camera space cubification with an auto-encoder, utilizing ground plane priors, predicting categorical depth distributions, dynamic message propagation conditioned on depth, and utilizing geometric constraints. The methods aim to overcome challenges of monocular 3D detection by leveraging techniques such as depth estimation, 3D feature representation learning, and integrating contextual and depth cues.
3-d interpretation from single 2-d image IIIYu Huang
This document summarizes several papers related to monocular 3D object detection for autonomous driving. The first paper proposes MoVi-3D, a single-stage architecture that leverages virtual views to reduce visual appearance variability from objects at different distances, enabling detection across depths. The second paper describes RTM3D, which predicts object keypoints and uses geometric constraints to recover 3D bounding boxes in real-time. The third paper decouples detection into structured polygon estimation and height-guided depth estimation. It predicts 2D object surfaces and uses object height to estimate depth.
본 논문은 single depth map으로부터의 정확한 3D hand pose estimation을 목표로 한다. 3D hand pose estimation은 HCI, AR등의 기술을 구현함에 있어서 매우 중요한 기술이다. 이를 위해 많은 연구자들이 정확도를 높이기 위해 여러 방법을 제시하였지만, 여전히 손가락들의 비슷한 생김새, 가려짐, 다양한 손가락의 움직임으로 인한 복잡성 때문에 정확도를 올리는데 한계가 있었다. 본 논문은 기존 방법들의 한계를 극복하기 위해 기존 방법들이 사용하는 입력 형태와 출력 형태를 바꾸었다. 2d depth image를 입력으로 받아 hand joint의 3D coordinate를 직접 regress하는 대부분의 기존 방법들과는 달리, 제안하는 모델은 3D voxelized depth map을 입력으로 받아 3D heatmap을 출력한다. 이를 위해 encoder-decoder 형식의 3D CNN을 사용하였고, 달라진 입력과 출력 형태로 인해 제안하는 모델은 널리 사용되는 3개의 3d hand pose estimation dataset, 1개의 3d human pose estimation dataset에서 가장 높은 성능을 내었다. 또한 ICCV 2017에서 주최된 HANDS 2017 challenge에서 우승 하였다.
Video Stitching using Improved RANSAC and SIFTIRJET Journal
1. The document discusses techniques for stitching multiple video frames into a panoramic video using Scale-Invariant Feature Transform (SIFT) and an improved RANSAC algorithm.
2. Key points and feature descriptors are extracted from frames using SIFT to find correspondences between frames. The improved RANSAC algorithm is used to estimate homography matrices between frames and filter outlier matches.
3. Frames are blended together to compensate for exposure differences and misalignments before being mapped to a reference plane to create the panoramic video mosaic. The algorithm aims to produce a high quality panoramic video in real-time.
Semi-supervised concept detection by learning the structure of similarity graphsSymeon Papadopoulos
This document proposes a semi-supervised concept detection approach based on graph structure features (GSF) extracted from image similarity graphs. GSF represents images as vectors based on eigenvectors of the graph Laplacian. Two incremental learning schemes are developed to address computational issues. Experiments on synthetic and MIR-Flickr datasets show the approach achieves performance comparable or better than state-of-the-art methods, and benefits from adding unlabeled data. The approach provides an efficient and scalable solution for concept detection in large multimedia collections.
This document summarizes deep learning techniques for 3D point clouds. It discusses methods for 3D shape classification, object detection and tracking, and segmentation. For classification, projection-based and point-based networks are examined. Point-based networks include MLP, graph-based, and convolution networks. Object detection methods include region proposal-based and single shot detection. Segmentation explores semantic, instance, and part segmentation using point-based networks.
An Assessment of Image Matching Algorithms in Depth EstimationCSCJournals
Computer vision is often used with mobile robot for feature tracking, landmark sensing, and obstacle detection. Almost all high-end robotics systems are now equipped with pairs of cameras arranged to provide depth perception. In stereo vision application, the disparity between the stereo images allows depth estimation within a scene. Detecting conjugate pair in stereo images is a challenging problem known as the correspondence problem. The goal of this research is to assess the performance of SIFT, MSER, and SURF, the well known matching algorithms, in solving the correspondence problem and then in estimating the depth within the scene. The results of each algorithm are evaluated and presented. The conclusion and recommendations for future works, lead towards the improvement of these powerful algorithms to achieve a higher level of efficiency within the scope of their performance.
3-d interpretation from single 2-d image VYu Huang
The document outlines several approaches for monocular 3D object detection from a single 2D image for autonomous driving applications. It summarizes MonoRUn, which uses self-supervised dense correspondences and geometry along with uncertainty propagation. It also summarizes M3DSSD, which uses feature alignment and asymmetric non-local attention in a single-stage detector. Additionally, it discusses analyzing and addressing localization errors, integrating differentiable NMS into training, and a flexible framework that decouples and adapts approaches for truncated vs normal objects.
Real-time large scale dense RGB-D SLAM with volumetric fusion extends KinectFusion to larger scales. It represents the volumetric reconstruction as a rolling buffer that translates as the camera moves. It estimates camera pose through combined geometric and photometric constraints. It closes loops by non-rigidly deforming the map with constraints from loop closures and jointly optimizes the camera poses and map. Evaluation shows it produces large, globally consistent, real-time dense reconstructions.
The document discusses content-based image retrieval and various techniques used for it. It begins by defining content-based image retrieval as taking a query image and ranking images in a large dataset based on how similar they are to the query. It then covers classic pipelines using SIFT features, using off-the-shelf CNN features, and learning representations specifically for retrieval. Methods discussed include spatial pooling of CNN activations, region pooling like R-MAC, and learning embeddings or features through triplet loss or diffusion-based ranking refinement. The goal is to learn representations from data that effectively capture semantic similarity for retrieval tasks.
This document describes a project to implement real-time facial recognition using OpenCV and Python. The project uses a laptop's webcam to capture video frames and detect and recognize faces in each frame. It trains an image dataset with face images and IDs then detects faces in each new video frame. It predicts faces by comparing features to the training data and labels matches based on a confidence level threshold. The document outlines the use of Haar cascade classifiers, LBPH algorithms, and OpenCV functions to complete the facial recognition process in real-time on new video frames from the webcam.
This document describes a project to implement real-time facial recognition using OpenCV and Python. The project uses a laptop's webcam to capture video frames and detect and recognize faces in each frame. It trains an image dataset with face images and IDs then detects faces in each new video frame. It predicts faces by comparing features to the training data and labels matches based on a confidence level threshold. The document outlines the use of Haar cascade classifiers, LBPH algorithms, and OpenCV functions to complete the facial recognition process in real-time on new video frames from the webcam.
PCA-SIFT: A More Distinctive Representation for Local Image Descriptorswolf
PCA-SIFT is a modification of SIFT that uses principal component analysis (PCA) to build more distinctive local image descriptors. It constructs a projection matrix from a large set of image patches, then projects each keypoint descriptor through this matrix to a compact vector of the top n principal components. This provides a more discriminative representation than SIFT while reducing descriptor dimensionality, leading to improved matching accuracy and efficiency. Evaluation on controlled transformation and graffiti datasets shows PCA-SIFT achieves higher recall rates at equivalent or lower false positive rates compared to SIFT.
This document discusses one-shot learning techniques for object recognition from few examples. It introduces the concepts of embedding spaces and similarity metrics for measuring distances between objects. Specific deep learning models are described, including Siamese networks, triplet networks, DeepFace, and FaceNet. Siamese networks aim to learn a similarity function using a contrastive loss over input pairs, while triplet networks employ a triplet loss to optimize relative distances between anchor, positive, and negative examples. DeepFace and FaceNet are state-of-the-art face recognition systems that use deep convolutional networks trained with triplet losses to learn embeddings that achieve human-level accuracy on benchmark face datasets.
1) The document discusses using data in deep learning models, including understanding the limitations of data and how it is acquired.
2) It describes techniques for image matching using multi-view geometry, including finding corresponding points across images and triangulating them to determine camera pose.
3) Recent works aim to improve localization of objects in images using multiple instance learning approaches that can learn without full supervision or through more stable optimization methods like linearizing sampling operations.
Recent Progress on Object Detection_20170331Jihong Kang
This slide provides a brief summary of recent progress on object detection using deep learning.
The concept of selected previous works(R-CNN series/YOLO/SSD) and 6 recent papers (uploaded to the Arxiv between Dec/2016 and Mar/2017) are introduced in this slide.
Most papers are focusing on improving the performance of small object detection.
This document proposes and evaluates several deep learning models for unsupervised monocular depth estimation. It begins with background on depth estimation methods and a literature review of recent work. Four depth estimation architectures are then described: EfficientNet-B7, EfficientNet-B3, DenseNet121, and DenseNet161. These models use an encoder-decoder structure with skip connections. An unsupervised loss function is adopted that combines appearance matching, disparity smoothness, and left-right consistency losses. The models are trained on the KITTI dataset and evaluated using standard KITTI metrics, showing improved performance over baseline methods using less training data and lower input resolution.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
3-d interpretation from single 2-d image IVYu Huang
This document summarizes several methods for monocular 3D object detection from a single 2D image for autonomous driving applications. It outlines methods that use pseudo-LiDAR representations, monocular camera space cubification with an auto-encoder, utilizing ground plane priors, predicting categorical depth distributions, dynamic message propagation conditioned on depth, and utilizing geometric constraints. The methods aim to overcome challenges of monocular 3D detection by leveraging techniques such as depth estimation, 3D feature representation learning, and integrating contextual and depth cues.
3-d interpretation from single 2-d image IIIYu Huang
This document summarizes several papers related to monocular 3D object detection for autonomous driving. The first paper proposes MoVi-3D, a single-stage architecture that leverages virtual views to reduce visual appearance variability from objects at different distances, enabling detection across depths. The second paper describes RTM3D, which predicts object keypoints and uses geometric constraints to recover 3D bounding boxes in real-time. The third paper decouples detection into structured polygon estimation and height-guided depth estimation. It predicts 2D object surfaces and uses object height to estimate depth.
본 논문은 single depth map으로부터의 정확한 3D hand pose estimation을 목표로 한다. 3D hand pose estimation은 HCI, AR등의 기술을 구현함에 있어서 매우 중요한 기술이다. 이를 위해 많은 연구자들이 정확도를 높이기 위해 여러 방법을 제시하였지만, 여전히 손가락들의 비슷한 생김새, 가려짐, 다양한 손가락의 움직임으로 인한 복잡성 때문에 정확도를 올리는데 한계가 있었다. 본 논문은 기존 방법들의 한계를 극복하기 위해 기존 방법들이 사용하는 입력 형태와 출력 형태를 바꾸었다. 2d depth image를 입력으로 받아 hand joint의 3D coordinate를 직접 regress하는 대부분의 기존 방법들과는 달리, 제안하는 모델은 3D voxelized depth map을 입력으로 받아 3D heatmap을 출력한다. 이를 위해 encoder-decoder 형식의 3D CNN을 사용하였고, 달라진 입력과 출력 형태로 인해 제안하는 모델은 널리 사용되는 3개의 3d hand pose estimation dataset, 1개의 3d human pose estimation dataset에서 가장 높은 성능을 내었다. 또한 ICCV 2017에서 주최된 HANDS 2017 challenge에서 우승 하였다.
Video Stitching using Improved RANSAC and SIFTIRJET Journal
1. The document discusses techniques for stitching multiple video frames into a panoramic video using Scale-Invariant Feature Transform (SIFT) and an improved RANSAC algorithm.
2. Key points and feature descriptors are extracted from frames using SIFT to find correspondences between frames. The improved RANSAC algorithm is used to estimate homography matrices between frames and filter outlier matches.
3. Frames are blended together to compensate for exposure differences and misalignments before being mapped to a reference plane to create the panoramic video mosaic. The algorithm aims to produce a high quality panoramic video in real-time.
Semi-supervised concept detection by learning the structure of similarity graphsSymeon Papadopoulos
This document proposes a semi-supervised concept detection approach based on graph structure features (GSF) extracted from image similarity graphs. GSF represents images as vectors based on eigenvectors of the graph Laplacian. Two incremental learning schemes are developed to address computational issues. Experiments on synthetic and MIR-Flickr datasets show the approach achieves performance comparable or better than state-of-the-art methods, and benefits from adding unlabeled data. The approach provides an efficient and scalable solution for concept detection in large multimedia collections.
This document summarizes deep learning techniques for 3D point clouds. It discusses methods for 3D shape classification, object detection and tracking, and segmentation. For classification, projection-based and point-based networks are examined. Point-based networks include MLP, graph-based, and convolution networks. Object detection methods include region proposal-based and single shot detection. Segmentation explores semantic, instance, and part segmentation using point-based networks.
An Assessment of Image Matching Algorithms in Depth EstimationCSCJournals
Computer vision is often used with mobile robot for feature tracking, landmark sensing, and obstacle detection. Almost all high-end robotics systems are now equipped with pairs of cameras arranged to provide depth perception. In stereo vision application, the disparity between the stereo images allows depth estimation within a scene. Detecting conjugate pair in stereo images is a challenging problem known as the correspondence problem. The goal of this research is to assess the performance of SIFT, MSER, and SURF, the well known matching algorithms, in solving the correspondence problem and then in estimating the depth within the scene. The results of each algorithm are evaluated and presented. The conclusion and recommendations for future works, lead towards the improvement of these powerful algorithms to achieve a higher level of efficiency within the scope of their performance.
3-d interpretation from single 2-d image VYu Huang
The document outlines several approaches for monocular 3D object detection from a single 2D image for autonomous driving applications. It summarizes MonoRUn, which uses self-supervised dense correspondences and geometry along with uncertainty propagation. It also summarizes M3DSSD, which uses feature alignment and asymmetric non-local attention in a single-stage detector. Additionally, it discusses analyzing and addressing localization errors, integrating differentiable NMS into training, and a flexible framework that decouples and adapts approaches for truncated vs normal objects.
Real-time large scale dense RGB-D SLAM with volumetric fusion extends KinectFusion to larger scales. It represents the volumetric reconstruction as a rolling buffer that translates as the camera moves. It estimates camera pose through combined geometric and photometric constraints. It closes loops by non-rigidly deforming the map with constraints from loop closures and jointly optimizes the camera poses and map. Evaluation shows it produces large, globally consistent, real-time dense reconstructions.
The document discusses content-based image retrieval and various techniques used for it. It begins by defining content-based image retrieval as taking a query image and ranking images in a large dataset based on how similar they are to the query. It then covers classic pipelines using SIFT features, using off-the-shelf CNN features, and learning representations specifically for retrieval. Methods discussed include spatial pooling of CNN activations, region pooling like R-MAC, and learning embeddings or features through triplet loss or diffusion-based ranking refinement. The goal is to learn representations from data that effectively capture semantic similarity for retrieval tasks.
This document describes a project to implement real-time facial recognition using OpenCV and Python. The project uses a laptop's webcam to capture video frames and detect and recognize faces in each frame. It trains an image dataset with face images and IDs then detects faces in each new video frame. It predicts faces by comparing features to the training data and labels matches based on a confidence level threshold. The document outlines the use of Haar cascade classifiers, LBPH algorithms, and OpenCV functions to complete the facial recognition process in real-time on new video frames from the webcam.
This document describes a project to implement real-time facial recognition using OpenCV and Python. The project uses a laptop's webcam to capture video frames and detect and recognize faces in each frame. It trains an image dataset with face images and IDs then detects faces in each new video frame. It predicts faces by comparing features to the training data and labels matches based on a confidence level threshold. The document outlines the use of Haar cascade classifiers, LBPH algorithms, and OpenCV functions to complete the facial recognition process in real-time on new video frames from the webcam.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSIJNSA Journal
The smart irrigation system represents an innovative approach to optimize water usage in agricultural and landscaping practices. The integration of cutting-edge technologies, including sensors, actuators, and data analysis, empowers this system to provide accurate monitoring and control of irrigation processes by leveraging real-time environmental conditions. The main objective of a smart irrigation system is to optimize water efficiency, minimize expenses, and foster the adoption of sustainable water management methods. This paper conducts a systematic risk assessment by exploring the key components/assets and their functionalities in the smart irrigation system. The crucial role of sensors in gathering data on soil moisture, weather patterns, and plant well-being is emphasized in this system. These sensors enable intelligent decision-making in irrigation scheduling and water distribution, leading to enhanced water efficiency and sustainable water management practices. Actuators enable automated control of irrigation devices, ensuring precise and targeted water delivery to plants. Additionally, the paper addresses the potential threat and vulnerabilities associated with smart irrigation systems. It discusses limitations of the system, such as power constraints and computational capabilities, and calculates the potential security risks. The paper suggests possible risk treatment methods for effective secure system operation. In conclusion, the paper emphasizes the significant benefits of implementing smart irrigation systems, including improved water conservation, increased crop yield, and reduced environmental impact. Additionally, based on the security analysis conducted, the paper recommends the implementation of countermeasures and security approaches to address vulnerabilities and ensure the integrity and reliability of the system. By incorporating these measures, smart irrigation technology can revolutionize water management practices in agriculture, promoting sustainability, resource efficiency, and safeguarding against potential security threats.
Low power architecture of logic gates using adiabatic techniquesnooriasukmaningtyas
The growing significance of portable systems to limit power consumption in ultra-large-scale-integration chips of very high density, has recently led to rapid and inventive progresses in low-power design. The most effective technique is adiabatic logic circuit design in energy-efficient hardware. This paper presents two adiabatic approaches for the design of low power circuits, modified positive feedback adiabatic logic (modified PFAL) and the other is direct current diode based positive feedback adiabatic logic (DC-DB PFAL). Logic gates are the preliminary components in any digital circuit design. By improving the performance of basic gates, one can improvise the whole system performance. In this paper proposed circuit design of the low power architecture of OR/NOR, AND/NAND, and XOR/XNOR gates are presented using the said approaches and their results are analyzed for powerdissipation, delay, power-delay-product and rise time and compared with the other adiabatic techniques along with the conventional complementary metal oxide semiconductor (CMOS) designs reported in the literature. It has been found that the designs with DC-DB PFAL technique outperform with the percentage improvement of 65% for NOR gate and 7% for NAND gate and 34% for XNOR gate over the modified PFAL techniques at 10 MHz respectively.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
Large Scale Image Retrieval 2022.pdf
1. Large Scale Image Retrieval
and Specific Object Search
Ondra Chum
Center for Machine Perception
Czech Technical University in Prague
2. Outline
• The correspondence problem
– Local features
– Descriptors
– Matching
– Geometry
• Retrieval with local features
– Bag of Words
– Geometry in image retrieval
– Beyond visual nearest neighbour search
• Image retrieval with CNNs
– Efficient network training
– Day / Night retrieval
4. 4
The Problem
Given a pair of images, find corresponding pixels
YES !
Semantic correspondence
NOT in this lecture
Image stitching
3D reconstruction
Augmented reality
Localization / camera position
5. 5
due to large viewpoint
change (including scale)
=>
the wide-baseline
stereo problem
Applications:
- pose estimation
- 3D reconstruction
- location recognition
Finding correspondences is not easy
6. 6
Finding correspondences is not easy
due to large viewpoint
change (including scale)
=>
the wide-baseline
stereo problem
7. 7
Applications:
- location recognition
- summarization of image
collections
Finding correspondences is not easy
due to large viewpoint
change (including scale)
=>
the wide-baseline
stereo problem
8. 8
Applications:
- historical reconstruction
- location recognition
- photographer
recognition
- camera type recognition
MPV course 2022, CTU in Prague
Finding correspondences is not easy
due to large
time difference
=>
the temporal-baseline
stereo problem
11. 11
Local Features
aka feature points, key points, anchor points, distinguished regions, …
• Repeatable features
• Feature descriptor: patch to a vector
• Similar features have similar descriptors – nearest neighbour search
• Retrieval – matching millions of images at the same time
• Detect features in images independently, local = robust to occlusions
13. 13
Corners Saddle points Blobs
Local (Handcrafted) Features
1. Enumerate all regions / level sets
2. Compute responses / stability
3. Local Non-Maxima Suppression
Regions
Harris [Harris’88]
Susan [Smith’97]
FAST/ ORB
[Rosten’06][Rublee’11]
Hessian [Lindeberg’91]
SADDLE [Aldana’16]
Hessian
DoG [Lowe’04]
MSER [Matas’02]
Tuytelaars
Simple idea – a distinguished feature should be different (at least)
from all its immediate neighbourhoods
Commonly
used for
deep features
14. 14
Deep Local Features
DELF – classification loss, landmark labelled images
[Noh, Araujo, Sim, Weyand, Han: Large-scale image retrieval with attentive deep local features. CVPR’17]
HOW – contrastive loss, SfM Retrieval – 3D reconstruction, image level
[Tolias, Jenicek, Chum: Learning and aggregating deep local descriptors for instance-level recognition ECCV’20]
D2 net – point correspondence supervision from 3D
[Dusmanu et al.: D2-net: A trainable CNN for joint detection and description of local features. CVPR’19]
R2D2 – point correspondence supervision from optical flow
[Revaud et.al., R2D2: Reliable and Repeatable Detector and Descriptor, NeurIPS 2019]
SuperPoint – synthetic images, augmentations
[DeTone, Malisiewicz, Rabinovich: SuperPoint: Self-supervised interest point detection and description, CVPRW’18]
R2D2 – Revaud 2019
DELF – Noh 2017
15. 15
Local Features from CNN Activations
Simeoni, Avrithis, Chum: Local Features and Visual Words Emerge in Activations, CVPR 2019
Convolutional layers Activation tensor Activation channel
(output of a detector)
• Treat the activation channel as an input to handcrafted feature detector (MSER)
• Use channel id as a descriptor (visual word)
…
…
18. 18
Affine Shape with CNNs
Mishkin, Radenović, Matas:
Repeatability Is Not Enough: Learning Affine Regions via Discriminability, ECCV 2018
AffNet
19. 19
Descriptors of Local Features
Direct description of a measurement region: e.g. moments
Local
feature
Measurement
region
20. 20
Descriptors of Local Features
Local
feature
Measurement
region
Normalize region to a canonical form first
Histogram of gradients
(root) SIFT
21. 21
Descriptors of Local Features
Bin Fan Yurun Tian and Fuchao Wu. L2-Net: Deep learning of discriminative patch
descriptor in euclidean space. CVPR 2017.
Anastasiya Mishchuk, Dmytro Mishkin, Filip Radenovic, Jiri Matas: Working hard
to know your neighbor's margins: Local descriptor learning loss, NIPS 2017
23. Toy example for illustration: matching with OpenCV SIFT
Try yourself: https://github.com/ducha-aiki/matching-strategies-comparison
24. Toy example for illustration: matching with OpenCV SIFT
Recovered 1st to 2nd image projection,
ground truth 1st to 2nd image project,
inlier correspondences
25. Nearest neighbor (NN) strategy
Features from img1 are
matched to features from img2
You can see, that it is asymmetric and
allowing “many-to-one” matches
26. Nearest neighbor (NN) strategy
OpenCV RANSAC failed to find a good model
with NN matching
Features from img1 are
matched to features from img2
27. Mutual nearest neighbor (MNN) strategy
Features from img1 are
matched to features from img2
Only cross-consistent
(mutual NNs) matches are retained.
28. Mutual nearest neighbor (MNN) strategy
OpenCV RANSAC failed to find a good
model with MNN matching
No one-to-many connections, but still bad
Features from img1 are
matched to features from img2
29. Feature space outlier rejection
• How can we tell which putative matches are more reliable?
• Heuristic: compare distance of the nearest neighbor to that of the
second nearest neighbor
– Ratio will be high for features that are not distinctive
– Threshold of 0.8 provides good separation
David Lowe. "Distinctive image features from scale-invariant keypoints.” IJCV 60 (2), pp. 91-110, 2004.
30. Second nearest neighbor ratio (SNN) strategy
1stNN
2ndNN
2ndNN
1stNN
2ndNN
1stNN
1stNN / 2ndNN > 0.8, drop
1stNN / 2ndNN < 0.8, keep
- we look for 2 nearest neighbors
- If both are too similar (1stNN/2ndNN
ratio > 0.8) → discard
- If 1st NN is much closer
(1stNN/2ndNN ratio ≤ 0.8) → keep
Features from img1 are
matched to features from img2
31. Second nearest neighbor ratio (SNN) strategy
1stNN
2ndNN
2ndNN
1stNN
1stNN / 2ndNN < 0.8, keep
OpenCV RANSAC found a model roughly
correct
32. 1st geometrically inconsistent nearest neighbor ratio (FGINN)
strategy
32
MPV course 2022, CTU in Prague
SNN ratio is good, but
what about symmetrical,
or too closely detected
features?
Ratio test will kill them.
Solution: look for 2nd
nearest neighbor, which
is spatially far enough
from 1st nearest.
Mishkin et al.,“MODS: Fast and Robust Method for Two-View Matching”, CVIU 2015
33. SNN vs FGINN
Mishkin et al., “MODS: Fast and Robust Method for Two-View Matching”, CVIU 2015
SNN: roughly
correct
FGINN: more
correspondences,
better geometry
found
34. 34
Idea: verify a tentative match “+“ by comparing neighboring features
[Schmid and Mohr: Local Greyvalue Invariants for Image Retrieval. PAMI 1997]
+
+
+
+
+
+
+
+
+ +
+
+
matching features
Local Geometric Constraints
image 1 image 2
35. 35
Cosegmentation / Seed Growing
Start from a seed – a signle strong match and try to locally “grow” the match
- at pixel or feature level
[Ferrari, Tuytelaars,Van Gool, ECCV 2004]
[Cech, Matas, Perdoch CVPR 08]
[Cavalli, Larsson, Oswald, Sattler, Pollefeys: AdaLAM, ECCV’20]
Seeds – semantic objects
Benbihi, Pradalier and Chum: Object-Guided Day-Night Visual Localization in Urban Scenes, ICPR’22
38. 38
Robust Estimation: Hough vs. RANSAC
Voting:
• discretized parameter space
• votes for parameters consistent
with the measurements
• more votes higher support
+ multiple models
+ can be very fast
- memory demanding
- distances measured in the
parameter space
RANSAC:
• hypothesize and verify loop
- randomized (unless you try it all)
- typically slower than voting
+ no extra memory required
+ measures distances in pixels!
42. 42
RANSAC
• Select sample of m points at
random
• Calculate model
parameters that fit the data
in the sample
43. 43
RANSAC
• Select sample of m points at
random
• Calculate model parameters
that fit the data in the sample
• Calculate error function
for each data point
44. 44
RANSAC
• Select sample of m points at
random
• Calculate model parameters
that fit the data in the sample
• Calculate error function for
each data point
• Select data that support
current hypothesis
45. 45
RANSAC
• Select sample of m points at
random
• Calculate model parameters
that fit the data in the sample
• Calculate error function for
each data point
• Select data that support
current hypothesis
• Repeat sampling
46. 46
RANSAC
• Select sample of m points at
random
• Calculate model parameters
that fit the data in the sample
• Calculate error function for
each data point
• Select data that support
current hypothesis
• Repeat sampling
47. 47
RANSAC
• Select sample of m points at
random
• Calculate model parameters
that fit the data in the sample
• Calculate error function for
each data point
• Select data that support
current hypothesis
• Repeat sampling
48. 48
RANSAC
k … number of samples
drawn
m … minimal sample size
N … number of data points
I … time to compute a
single model
p … confidence in the
solution (.95)
log (1- )
log(1 – p)
I m
Nm
k =
50. 50
RANSAC [Fischler, Bolles ’81]
In: U = {xi} set of data points, |U| = N
function f computes model parameters p given a sample S from U
the cost function for a single data point x
Out: p* p*, parameters of the model maximizing the cost function
k := 0
Repeat until P{better solution exists} < η (a function of C* and no. of steps k)
k := k + 1
I. Hypothesis
(1) select randomly set , sample size
(2) compute parameters
II. Verification
(3) compute cost
(4) if C* < Ck then C* := Ck, p* := pk
end
51. 51
Advanced RANSAC
In: U = {xi} set of data points, |U| = N
function f computes model parameters p given a sample S from U
the cost function for a single data point x
Out: p* p*, parameters of the model maximizing the cost function
k := 0
Repeat until P{better solution exists} < η (a function of C* and no. of steps k)
k := k + 1
I. Hypothesis
(1) select randomly set , sample size
(2) compute parameters
II. Verification
(3) compute cost
(4) if C* < Ck then C* := Ck, p* := pk
end
Non-uniform sampling
Error scale estimation
Potential degeneracy tests
Randomized verification
Preemptive scoring
Improving precision
54. 54
Image Retrieval
Find this …
… in a large (millions+) collection of images
?
• Find images of the same object
• What is this? Nearest neighbor classifier
• Where is this? Visual localization
• How did this look in the past?
• Is there anything interesting here?
56. 56
Feature Based Retrieval
• Affine invariant features
• Efficient descriptors
• Corresponding regions in images have similar
descriptors – measured by some distance in
the features space
• Images of the same object have many
correspondences in common
57. 57
Video Google
• Feature detection and description
• Vector quantization
• Bag of Words representation
• Scoring
• Verification
Sivic & Zisserman – ICCV 2003
Video Google: A Text Retrieval Approach to Object Matching in Videos
59. 59
Feature Distance Approximation
Partition the feature space
(k – means clustering)
Feature distance
0 : features in the same cell
∞ : features in different cells
+ most of the features are not
considered (infinitely distant)
+ near-by descriptors accessible
instantly – storing a list of
features for each cell
60. 60
Feature Distance Approximation
- quantization effects
- large (even unbounded) cells
Feature distance
0 : features in the same cell
∞ : features in different cells
61. 61
Vector Quantization via k-Means
Initialize cluster
centres
Find nearest cluster to each
datapoint (slow) O(N k)
Re-compute cluster
centres as centroids
Iterate
62. 62
Bags of Words Image Representation
A
C
D
B
A
C
D
B
1
0
0
2
0
3
0
1
Images
…
Visual
vocabulary
Images are represented by vector / histogram of
visual words present in them
Term-frequency (tf) – visual word D is twice in the image
sparse
64. 64
Efficient Scoring
bag of words representation
(up to 1,000,000 D)
0
3
0
1
α1 ( 1 0 0 2 )
α2 ( 0 2 0 1 )
α3 ( 1 0 0 0 )
…
Database Query
• =
Score
αq
s2
s3
…
A C D
B
A
C
D
B
s1
65. 65
1 2 3 4 5 6 7 8 9 10
BoW and Inverted File
6 7 7 …
1 3 6
…
5 6 8
…
2 4 10 …
A
C
D
B
Visual
vocabulary
…
A C
D B
A A
B
B
C
C
D
D
…
…
…
…
…
…
…
…
…
…
66. 66
1 2 3 4 5 6 7 8 9 10
BoW and Inverted File
6 7 7 …
1 3 6 …
5 6 8 …
query visual word 1
query visual word 2
query visual word 3
D
B
G
67. 67
1 2 3 4 5 6 7 8 9 10
BoW and Inverted File
Efficient (fast)
Linear complexity (in # documents)
Can be interpreted as voting
68. 69
Geometric Verification and Re-ranking
Query
Results
reject
verify
localize
Philbin, Chum, Isard, Sivic, Zisserman: Object retrieval with large
vocabularies and fast spatial matching, CVPR’07
71. 72
Hierarchical k-means
+ fast O(N log k)
+ incremental construction
- not so good quantization
- often imbalanced
Nistér & Stewénius: Scalable recognition with a vocabulary tree. CVPR 2006
72. 73
Approximate k-means
+ fast O(N log k)
+ reasonable quantization
- Can be inconsistent when ANN fails
Philbin, Chum, Isard, Sivic, and Zisserman – CVPR 2007
Object retrieval with large vocabularies and fast spatial matching
Initialize cluster
centres
Find approximate nearest
cluster to each datapoint
Re-compute cluster
centres as centroids
Iterate
73. 74
Hamming Embedding
+ good quantization
+ elegant idea
- huge memory footprint
0 1
0
1
1
1
0 0
0
0
1
1
Hamming
distance
1
1
2
Jegou, Douze, and Schmid – ECCV 2008
Hamming embedding and weak geometric consistency for large scale image search
random projections
74. 75
Soft Assignment
(Approximate) k-means
- database side
- query side
Hierarchical k-means
Philbin, Chum, Isard, Sivic, and Zisserman – CVPR 2008
Lost in Quantization
Nistér & Stewénius – CVPR 2006 Scalable
recognition with a vocabulary tree
75. 76
Learning Fine Vocabularies
Fine vocabulary (16 million visual words)
Using wide-baseline stereo matches on 6 million images to learn what is similar
Mikulik, Perdoch, Chum, and Matas: Learinig a Fine Vocabulary, ECCV 2010
76. 77
Appearance Variance of a Single Feature
Mikulik, Perdoch, Chum, Matas: Learning Vocabularies over a Fine Quantization, IJCV 2012
• over 5 million images
• almost 20k clusters of 750k images (visual word based)
• 733k successfully matched in WBS matching (raw descriptor based)
• over 111 M feature tracks established (12.3 M with 6+ features)
• 564 M features in the tracks (319.5 M in tracks of 6+ features)
http://cmp.felk.cvut.cz/~qqmikula/publications/ijcv2012/index.html
77. 78
Short Codes – (Joint) Dimensionality Reduction
Jegou & Chum: Negative evidences and co-occurrences in image retrieval: the benefit of PCA and
whitening, ECCV 2012
Radenovic, Jegou & Chum: Multiple Measurements and Joint Dimensionality Reduction for
Large Scale Image Search with Short Vectors ICMR 2015
78. 79
Aggregating Local Descriptors
A
C
D
B
VLAD descriptor
[Jégou, Douze, Schmid and Pérez, CVPR’10]
Fischer Kernel approach
[Perronnin and Dance, CVPR’07]
often combined with dimensionality
reduction by PCA – short codes
• High discriminability needed
• BOW increases the number of visual words
• only assignments are recorded
Idea: using higher order statistics
• small vocabulary (fast assignment)
• dense vectors (ANN search)
• high disriminability
79. 80
Aggregating Local Descriptors
A
C
D
B
VLAD descriptor
[Jégou, Douze, Schmid and Pérez, CVPR’10]
1. compute assignments
2. compute difference to means
3. sum differences per visual word
80. 81
• Fit a GMM to training data (SIFT)
• diagonal covariance matrix
• whitened data
• Image represented as a sum (over image
features) of gradients of log-likelihood
• fixed size representation (#parameters)
Aggregating Local Descriptors
A
C
D
B
Fischer Kernel approach
[Perronnin and Dance, CVPR’07]
Intuition: direction in which the parameters λ
of the general model should we modified to
better fit the specific sample (current image
data).
87. 89
Context expansion
• the model of the object is grown beyond the boundaries of the
initial query,
• a feature added into the model that is not inside the context is
inactive until confirmed by feature(s) from another image with
the same visual word and similar geometry.
• Once a feature is confirmed, it adds the neighbourhood around
its center to the context.
Chum, Mikulik, Perdoch, Matas: Total Recall II: Query Expansion Revisited, CVPR 2011
88. 90
• the model of the object is grown beyond the boundaries of the
initial query,
• a feature added into the model that is not inside the context is
inactive until confirmed by feature(s) from another image with
the same visual word and similar geometry.
• Once a feature is confirmed, it adds the neighbourhood around
its center to the context.
Context expansion
Chum, Mikulik, Perdoch, Matas: Total Recall II: Query Expansion Revisited, CVPR 2011
90. 92
How Much Do We Need to See?
Oxford landmarks – 3 queries
100%, 50%, and 10% of the query bounding box
Context learned from the full bounding box
Context learned from 50% of the bounding box
Context learned from 10% of the bounding box
91. 93
Effects of decreasing the
query bounding-box size
Baseline:
spatial verification +
full bounding box
Context QE at the baseline
performance needs only:
• 20% of the BB on the
Paris dataset
• 40% of the BB on the
Oxford dataset
94. 96
Retrieval for Browsing
Query 1
Query 2
Mikulik, Chum, Matas: Image Retrieval for Online Browsing in Large Image Collections, SISAP 2013.
95. 97
New Problem Formulation
Retrieve relevant images subject to a constraint
• Geometric
– Maximize number of relevant pixels
– Maximize scale change
– Change of viewpoint
• Other
– High photometric change (day / night)
96. 98
New Problem Formulation
Results
• Low rank in standard similarity measure
– Geometry for verification and constraint enforcement
– Geometry in the inverted file (DAAT)
• Standard similarity measure can be 0
– Matching through a path of images (query expansion)
99. 102
Highest Resolution Transform
Given a query and a dataset, for every pixel in the query image:
Find the database image with the maximum resolution depicting the pixel
37.3x 27.0x 22.8x 21.9x 21.6x
101. 104
Level of Interest Transform
Given a query and a dataset, for every pixel in the query image:
Find the frequency with which it is photographed in detail
0 – 1 % 1 – 3 % 3 – 10 %
detail
size
104. 107
Tight Coupling of Retrieval and SfM
Schoenberger, Radenovic, Chum, and Frahm:
From Single Image Query to Detailed 3D Reconstruction , CVPR’15
105. Beyond Nearest Neighbour
Looking around the corner
• Zoom out – getting a context of the image
• All details – getting transition to the object details
• Sidewise crawl
108. 111
Efficient Search with Global Descriptors
Find this … … in a large collection of images
?
Mapping into high dimensional space
k ~ 512 … 2048
Image similarity – distance
descriptor space Rk
109. 112
Efficient Search with Global Descriptors
Find this … … in a large collection of images
descriptor space Rk
110. 113
CNN Descriptors for Image Retrieval
…
Max pooling
+ L2-norm
K x 1
MAC
vec.
Image Convolutional Layers MAC Layer Descriptor
𝑤𝑤 ×ℎ×3 𝑊𝑊 ×𝐻𝐻 ×𝐾𝐾 𝐾𝐾 ×1
𝑤𝑤 × ℎ – image width and height
𝑊𝑊 × 𝐻𝐻 – number of activations for feature map 𝑘𝑘 ∈ {1 … 𝐾𝐾}
𝐾𝐾 – number of feature maps in the last convolutional layer
MAC – Maximum Activations of Convolutions
111. 114
CNN Descriptors for Image Retrieval
…
Image Convolutional Layers
𝑤𝑤 ×ℎ×3 𝑊𝑊 ×𝐻𝐻 ×𝐾𝐾
𝑤𝑤 × ℎ – image width and height
𝑊𝑊 × 𝐻𝐻 – number of activations for feature map 𝑘𝑘 ∈ {1 … 𝐾𝐾}
𝐾𝐾 – number of feature maps in the last convolutional layer
Sum pooling
+ L2-norm
K x 1
SPoC
vec.
SPcC Layer Descriptor
𝐾𝐾 ×1
SPoC – sum-pooled convolutional
112. 115
CNN Descriptors for Image Retrieval
…
Image Convolutional Layers
𝑤𝑤 ×ℎ×3 𝑊𝑊 ×𝐻𝐻 ×𝐾𝐾
𝑤𝑤 × ℎ – image width and height
𝑊𝑊 × 𝐻𝐻 – number of activations for feature map 𝑘𝑘 ∈ {1 … 𝐾𝐾}
𝐾𝐾 – number of feature maps in the last convolutional layer
Descriptor
GeM pooling
+ L2-norm
K x 1
GeM
vec.
GeM Layer
𝐾𝐾 ×1
GeM– Generalized Mean
p = 1
average pooling
p = inf
max pooling
119. “Lots of Training Examples”
Large Internet
photo collection
…
Convolutional Neural
Network (CNN)
Image annotations
Training
120. “Lots of Training Examples”
Large Internet
photo collection
…
Convolutional Neural
Network (CNN)
Not accurate
Expensive $$
Manual cleaning of
the training data
done by Researchers
Very expensive $$$$
Automated extraction
of training data
Very accurate
Free $
121. • Image representation created from CNN activations
of a network pre-trained for classification task
[Gong et al. ECCV’14, Razavian et al. arXiv’14, Babenko et al.
ICCV’15, Kalantidis et al. arXiv’15, Tolias et al. ICLR’16]
+ Retrieval accuracy suggests generalization of CNNs
- Trained for image classification, NOT retrieval task
CNN Image Retrieval
122. • Image representation created from CNN activations
of a network pre-trained for classification task
[Gong et al. ECCV’14, Razavian et al. arXiv’14, Babenko et al.
ICCV’15, Kalantidis et al. arXiv’15, Tolias et al. ICLR’16]
+ Retrieval accuracy suggests generalization of CNNs
- Trained for image classification, NOT retrieval task
CNN Image Retrieval
Same Class
Image from ImageNet.org
123. CNN Image Retrieval
• CNN network re-trained using a dataset that contains
landmarks and buildings as object classes.
[Babenko et al. ECCV’14]
+ Training dataset closer to the target task
- Final metric different to the one actually optimized
- Constructing training datasets requires manual effort
124. CNN Image Retrieval
• CNN network re-trained using a dataset that contains
landmarks and buildings as object classes.
[Babenko et al. ECCV’14]
+ Training dataset closer to the target task
- Final metric different to the one actually optimized
- Constructing training datasets requires manual effort
Same Class
Image from [Babenko et al. ECCV’14]
125. CNN Image Retrieval
• NetVLAD: end-to-end fine-tuning for image retrieval.
Geo-tagged dataset for weakly supervised fine-tuning.
[Arandjelovic et al. CVPR’16]
+ Training dataset corresponds to the target task
+ Final metric corresponds to the one actually optimized
- Training dataset requires geo-tags
126. CNN Image Retrieval
• NetVLAD: end-to-end fine-tuning for image retrieval.
Geo-tagged dataset for weakly supervised fine-tuning.
[Arandjelovic et al. CVPR’16]
+ Training dataset corresponds to the target task
+ Final metric corresponds to the one actually optimized
- Training dataset requires geo-tags
query
Camera Orientation Unknown
unknown
127. CNN learns from BoW – Training Data
Input: Large unannotated dataset
1. Initial clusters created by grouping of spatially
related images [Chum & Matas PAMI’10]
2. Clustered images used as queries for a retrieval-
SfM pipeline [Schonberger et al. CVPR’15]
Output: Non-overlapping 3D models
551 (134k) training / 162 (30k) validation
Camera Orientation Known
Number of Inliers Known
128. CNN learns from BoW – Positives
1. Descriptor distance: Image with the lowest global
descriptor distance is chosen (NetVLAD use this)
2. Maximum inliers: Image with the highest number of
co-observed 3D points with the query image is chosen
3. Relaxed inliers: Random image close to the query, with
enough inliers and not an extreme scale change is chosen
query m 1 m 2 m 3
129. CNN learns from BoW – Negatives
K-nearest neighbors of the query image are selected from
all non-matching clusters, using different methods:
1. No constraint: chosen images often near identical.
2. At most one image per cluster: higher variability.
query hardest negative N 1 N 2
133. 136
Day – Night Retrieval
Day – Night training image pairs – sequences of images day – evening - night
Photometric normalization
134. 137
Contrast Limited Adaptive Histogram Equalization
• Semi local (windows)
• Linear interpolation
• Only values more frequent than
the clipping limit are redistributed
clipping
limit
Original Historam Equalization (global) CLAHE
[Jenicek, Chum: No Fear of the Dark: Image Retrieval under Varying Illumination Conditions, ICCV 2019]