Learning visual representation without human labelKai-Wen Zhao
Self supervised learning (SSL) is one of the most fast-growing research topic in recent years. SSL provides algorithm that directly learn visual representation from data itself rather than human manual labels. From theoretical point of view, SSL explores information theory & the nature of large scale dataset.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/ceva/embedded-vision-training/videos/pages/may-2014-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Adar Paz, Imaging and Computer Vision Team Leader at CEVA, presents the "Challenges in Object Detection on Embedded Devices" tutorial at the May 2014 Embedded Vision Summit.
As more products ship with integrated cameras, there is an increased potential for computer vision (CV) to enable innovation. For instance, CV can tackle the "scene understanding" problem by first figuring out what the various objects in the scene are. Such "object detection" capability holds big promise for embedded devices in mobile, automotive, and surveillance markets. However, performing real-time object detection while meeting a strict power budget remains a challenge on existing processors.
In this session, Paz analyzes the trade-offs of various object detection, feature extraction and feature matching algorithms, their suitability for embedded vision processing, and recommends methods for efficient implementation in a power- and budget-constrained embedded device.
Learning visual representation without human labelKai-Wen Zhao
Self supervised learning (SSL) is one of the most fast-growing research topic in recent years. SSL provides algorithm that directly learn visual representation from data itself rather than human manual labels. From theoretical point of view, SSL explores information theory & the nature of large scale dataset.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/ceva/embedded-vision-training/videos/pages/may-2014-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Adar Paz, Imaging and Computer Vision Team Leader at CEVA, presents the "Challenges in Object Detection on Embedded Devices" tutorial at the May 2014 Embedded Vision Summit.
As more products ship with integrated cameras, there is an increased potential for computer vision (CV) to enable innovation. For instance, CV can tackle the "scene understanding" problem by first figuring out what the various objects in the scene are. Such "object detection" capability holds big promise for embedded devices in mobile, automotive, and surveillance markets. However, performing real-time object detection while meeting a strict power budget remains a challenge on existing processors.
In this session, Paz analyzes the trade-offs of various object detection, feature extraction and feature matching algorithms, their suitability for embedded vision processing, and recommends methods for efficient implementation in a power- and budget-constrained embedded device.
#PR12 #PR366
안녕하세요 논문 읽기 모임 PR-12의 366번째 논문리뷰입니다.
올해가 AlexNet이 나온지 10주년이 되는 해네요.
AlexNet이 2012년에 혜성처럼 등장한 이후, Solve computer vision problem = Use CNN이 공식처럼 사용되던 2010년대가 가고
2020년대 들어서 ViT의 등장을 시작으로 Transformer 기반의 network들이 CNN의 자리를 위협하고 상당부분 이미 뺏어간 상황입니다.
2020년대에 CNN의 가야할 길은 어디일까요?
Inductive bias가 적은 Transformer가 대용량의 데이터로 학습하면 항상 CNN보다 더 낫다는 건 진실일까요?
이 논문에서는 2020년대를 위한 CNN이라는 제목으로 ConvNeXt라는 새로운(?) architecture를 제안합니다.
사실 새로운 건 없고 그동안 있었던 것들과 Transformer에서 적용한 것들을 copy해와서 CNN에 적용해보았는데요,
Transformer보다 성능도 좋고 속도도 빠른 결과가 나왔다고 합니다.
결과에 대해서 약간의 논란이 twitter 상에서 나오고 있는데 이 부분 포함해서 자세한 내용은 영상을 통해서 보실 수 있습니다.
늘 재밌게 봐주시고 좋아요 댓글 구독 해주시는 분들께 감사드립니다 :)
논문링크: https://arxiv.org/abs/2201.03545
영상링크: https://youtu.be/Mw7IhO2uBGc
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh
computer vision 분야에서 dominant 한 Convolutional Layer를 일절 사용하지 않고, NLP에서 제안된 순수 Transformer의 architecture를 그대로 가져와 Attention과 일반 Feed Forward NN만을 이용하여 SOTA수준의 Image Classification Model을 구축한다.
TAVE research seminar 21.03.30 발표자료
발표자: 오창대
TLDR (Twin Learning for Dimensionality Reduction) is an unsupervised dimensionality reduction method that combines neighborhood embedding learning with the simplicity and effectiveness of recent self-supervised learning losses.
http://imatge-upc.github.io/retrieval-2017-cam/
Image retrieval in realistic scenarios targets large dynamic datasets of unlabeled images. In these cases, training or fine-tuning a model every time new images are added to the database is neither efficient nor scalable.
Convolutional neural networks trained for image classification over large datasets have been proven effective feature extractors when transferred to the task of image retrieval. The most successful approaches are based on encoding the activations of convolutional layers as they convey the image spatial information. Our proposal goes beyond and aims at a local-aware encoding of these features depending on the predicted image semantics, with the advantage of using only of the knowledge contained inside the network.
In particular, we employ Class Activation Maps (CAMs) to obtain the most discriminative regions from a semantic perspective. Additionally, CAMs are also used to generate object proposals during an unsupervised re-ranking stage after a first fast search.
Our experiments on two public available datasets for instance retrieval, Oxford5k and Paris6k, demonstrate that our system is competitive and even outperforms the current state-of-the-art when using off-the-shelf models trained on the object classes of ImageNet.
[2020 CVPR Efficient DET 논문리뷰]
안녕하세요 딥러닝 논문읽기 모임입니다.
오늘 소개드릴 논문은 2020 CVPR에서 발표된, Efficient Net 저자가 발표한 'Efficient DET'입니다. 제목에서 유추 가능하듯 Backbone을 Efficient Net으로 사용하여 Object Detection Task에 적용 했다는 점을 유추할 수 있습니다.
해당 논문은 위 사실을 제외하고, 조금 특별한 Feature Pyramid Network를 적용하여 더욱 성능적으로 향상을 시켰는대요.
이러한 내용을 바탕으로 아직까지도 paperswithcode에 상위권에 랭크되어 있는 논문 입니다.
오늘 논문 이미지처리팀 이찬혁님이 자세하고 디테일한 리뷰를 도와주셨습니다.
많은 관심 미리 감사드립니다!
https://youtu.be/Mq4aqDgZ2bI
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...Sunghoon Joo
PR-325: Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
paper link: https://arxiv.org/abs/2004.00849
youtube link: https://youtu.be/Kgh88DLHHTo
ConvNeXt: A ConvNet for the 2020s explainedSushant Gautam
Explained here: https://youtu.be/aBvDPL1jFnI
In Nepali
A ConvNet for the 2020s (Zhuang Liu et al.)
ComvNeXt paper
Deep Learning for Visual Intelligence
Sushant Gautam
MSCIISE
Department of Electronics and Computer Engineering
Institute of Engineering, Thapathali Campus
13 March 2022
To all the authors (obviously!!)
1. Jinwon Lee's slides at https://www.slideshare.net/JinwonLee9/pr366-a-convnet-for-2020s?qid=274bc524-23ae-4c13-b03b-0d2416976ad5&v=&b=&from_search=1
2. Letitia from AI Coffee Break: https://www.youtube.com/watch?v=SndHALawoag
I even edited some of her hard visual works and put them as a slide. :(
#PR12 #PR366
안녕하세요 논문 읽기 모임 PR-12의 366번째 논문리뷰입니다.
올해가 AlexNet이 나온지 10주년이 되는 해네요.
AlexNet이 2012년에 혜성처럼 등장한 이후, Solve computer vision problem = Use CNN이 공식처럼 사용되던 2010년대가 가고
2020년대 들어서 ViT의 등장을 시작으로 Transformer 기반의 network들이 CNN의 자리를 위협하고 상당부분 이미 뺏어간 상황입니다.
2020년대에 CNN의 가야할 길은 어디일까요?
Inductive bias가 적은 Transformer가 대용량의 데이터로 학습하면 항상 CNN보다 더 낫다는 건 진실일까요?
이 논문에서는 2020년대를 위한 CNN이라는 제목으로 ConvNeXt라는 새로운(?) architecture를 제안합니다.
사실 새로운 건 없고 그동안 있었던 것들과 Transformer에서 적용한 것들을 copy해와서 CNN에 적용해보았는데요,
Transformer보다 성능도 좋고 속도도 빠른 결과가 나왔다고 합니다.
결과에 대해서 약간의 논란이 twitter 상에서 나오고 있는데 이 부분 포함해서 자세한 내용은 영상을 통해서 보실 수 있습니다.
늘 재밌게 봐주시고 좋아요 댓글 구독 해주시는 분들께 감사드립니다 :)
논문링크: https://arxiv.org/abs/2201.03545
영상링크: https://youtu.be/Mw7IhO2uBGc
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh
computer vision 분야에서 dominant 한 Convolutional Layer를 일절 사용하지 않고, NLP에서 제안된 순수 Transformer의 architecture를 그대로 가져와 Attention과 일반 Feed Forward NN만을 이용하여 SOTA수준의 Image Classification Model을 구축한다.
TAVE research seminar 21.03.30 발표자료
발표자: 오창대
TLDR (Twin Learning for Dimensionality Reduction) is an unsupervised dimensionality reduction method that combines neighborhood embedding learning with the simplicity and effectiveness of recent self-supervised learning losses.
http://imatge-upc.github.io/retrieval-2017-cam/
Image retrieval in realistic scenarios targets large dynamic datasets of unlabeled images. In these cases, training or fine-tuning a model every time new images are added to the database is neither efficient nor scalable.
Convolutional neural networks trained for image classification over large datasets have been proven effective feature extractors when transferred to the task of image retrieval. The most successful approaches are based on encoding the activations of convolutional layers as they convey the image spatial information. Our proposal goes beyond and aims at a local-aware encoding of these features depending on the predicted image semantics, with the advantage of using only of the knowledge contained inside the network.
In particular, we employ Class Activation Maps (CAMs) to obtain the most discriminative regions from a semantic perspective. Additionally, CAMs are also used to generate object proposals during an unsupervised re-ranking stage after a first fast search.
Our experiments on two public available datasets for instance retrieval, Oxford5k and Paris6k, demonstrate that our system is competitive and even outperforms the current state-of-the-art when using off-the-shelf models trained on the object classes of ImageNet.
[2020 CVPR Efficient DET 논문리뷰]
안녕하세요 딥러닝 논문읽기 모임입니다.
오늘 소개드릴 논문은 2020 CVPR에서 발표된, Efficient Net 저자가 발표한 'Efficient DET'입니다. 제목에서 유추 가능하듯 Backbone을 Efficient Net으로 사용하여 Object Detection Task에 적용 했다는 점을 유추할 수 있습니다.
해당 논문은 위 사실을 제외하고, 조금 특별한 Feature Pyramid Network를 적용하여 더욱 성능적으로 향상을 시켰는대요.
이러한 내용을 바탕으로 아직까지도 paperswithcode에 상위권에 랭크되어 있는 논문 입니다.
오늘 논문 이미지처리팀 이찬혁님이 자세하고 디테일한 리뷰를 도와주셨습니다.
많은 관심 미리 감사드립니다!
https://youtu.be/Mq4aqDgZ2bI
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...Sunghoon Joo
PR-325: Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
paper link: https://arxiv.org/abs/2004.00849
youtube link: https://youtu.be/Kgh88DLHHTo
ConvNeXt: A ConvNet for the 2020s explainedSushant Gautam
Explained here: https://youtu.be/aBvDPL1jFnI
In Nepali
A ConvNet for the 2020s (Zhuang Liu et al.)
ComvNeXt paper
Deep Learning for Visual Intelligence
Sushant Gautam
MSCIISE
Department of Electronics and Computer Engineering
Institute of Engineering, Thapathali Campus
13 March 2022
To all the authors (obviously!!)
1. Jinwon Lee's slides at https://www.slideshare.net/JinwonLee9/pr366-a-convnet-for-2020s?qid=274bc524-23ae-4c13-b03b-0d2416976ad5&v=&b=&from_search=1
2. Letitia from AI Coffee Break: https://www.youtube.com/watch?v=SndHALawoag
I even edited some of her hard visual works and put them as a slide. :(
Locating objects in images (“detection”) quickly and efficiently enables object tracking and counting applications on embedded visual sensors (fixed and mobile). By 2012, progress on techniques for detecting objects in images – a topic of perennial interest in computer vision – had plateaued, and techniques based on histogram of oriented gradients (HOG) were state of the art. Soon, though, convolutional neural networks (CNNs), in addition to classifying objects, were also beginning to become effective at simultaneously detecting objects. Research in CNN-based object detection was jump-started by the groundbreaking region-based CNN (R-CNN). We’ll follow the evolution of neural network algorithms for object detection, starting with R-CNN and proceeding to Fast R-CNN, Faster R-CNN, “You Only Look Once” (YOLO), and up to the latest Single Shot Multibox detector. In this talk, we’ll examine the successive innovations in performance and accuracy embodied in these algorithms – which is a good way to understand the insights behind effective neural-network-based object localization. We’ll also contrast bounding-box approaches with pixel-level segmentation approaches and present pros and cons.
https://telecombcn-dl.github.io/dlmm-2017-dcu/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
The slides for the techniques used in the Temporal Segment Network (TSN), including the basic ideas, recall of BN-Inception, optical flow and tricks in application. Used in group paper reading in University of Sydney.
Developing and Deploying Deep Learning Based Computer Vision Systems - Alka N...CodeOps Technologies LLP
Deep Learning is enabling a wide range of computer vision applications from advanced driver assistance systems to sophisticated medical diagnostic devices. However, designing and deploying these applications involve a lot of challenges like handling large datasets, developing optimized models, effectively performing GPU computing and efficiently deploying deep learning models to embedded boards like NVIDIA Jetson. This session illustrates how MATLAB supports all phases of this workflow starting with algorithm design to automatically generating portable and optimized CUDA code helping engineers and scientists address the commonly observed challenges in deep learning workflow
Introduction to computer vision with Convoluted Neural NetworksMarcinJedyk
Introduction to computer vision with Convoluted Neural Networks - going over history of CNNs, describing basic concepts such as convolution and discussing applications of computer vision and image recognition technologies
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2023/11/understanding-selecting-and-optimizing-object-detectors-for-edge-applications-a-presentation-from-walmart-global-tech/
Md Nasir Uddin Laskar, Staff Machine Learning Engineer at Walmart Global Tech, presents the “Understanding, Selecting and Optimizing Object Detectors for Edge Applications” tutorial at the May 2023 Embedded Vision Summit.
Object detectors count objects in a scene and determine their precise locations, while also labeling them. Object detection plays a crucial role in many vision applications, from autonomous driving to smart appliances. In many of these applications, it’s necessary or desirable to implement object detection at the edge.
In this presentation, Laskar explores the evolution of object detection algorithms, from traditional approaches to deep learning-based methods and transformer-based architectures. He delves into widely used approaches for object detection, such as two-stage R-CNNs and one-stage YOLO algorithms, and examines their strengths and weaknesses. And he provides guidance on how to evaluate and select an object detector for an edge application.
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Fast and Scalable NUMA-based Thread Parallel Breadth-first SearchYuichiro Yasui
The 2015 International Conference on High Performance Computing & Simulation (HPCS2015)
Session 9A: July 22, 14:45 − 16:00
July 20 – 24, 2015, Amsterdam, the Netherlands
Computer Vision Landscape : Present and FutureSanghamitra Deb
Millions of people all around the world Learn with Chegg. Education at Chegg is powered by the depth and diversity of the content that we have. A huge part of our content is in form of images. These images could be uploaded by students or by content creators. Images contain text that is extracted using a transcription service. Very often uploaded images are noisy. This leads to irrelevant characters or words in the transcribed text. Using object detection techniques we develop a service that extracts the relevant parts of the image and uses a transcription service to get clean text. In the first part of the presentation, I will talk about building an object detection model using YOLO for cropping and masking images to obtain a cleaner text from transcription. YOLO is a deep learning object detection and recognition modeling framework that is able to produce highly accurate results with low latency. In the next part of my presentation, I will talk about the building the Computer Vision landscape at Chegg. Starting from images on academic materials that are composed of elements such as text, equations, diagrams we create a pipeline for extracting these image elements. Using state of the art deep learning techniques we create embeddings for these elements to enhance downstream machine learning models such as content quality and similarity.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Visual odometry & slam utilizing indoor structured environmentsNAVER Engineering
Visual odometry (VO) and simultaneous localization and mapping (SLAM) are fundamental building blocks for various applications from autonomous vehicles to virtual and augmented reality (VR/AR).
To improve the accuracy and robustness of the VO & SLAM approaches, we exploit multiple lines and orthogonal planar features, such as walls, floors, and ceilings, common in man-made indoor environments.
We demonstrate the effectiveness of the proposed VO & SLAM algorithms through an extensive evaluation on a variety of RGB-D datasets and compare with other state-of-the-art methods.
Similar to Improving region based CNN object detector using bayesian optimization (20)
Auto-Encoders and PCA, a brief psychological backgroundAmgad Muhammad
A Psychological background on how we think and store memory to explain the motivation behind the Autoencoders and then comparing the performance, in terms of reconstruction error, of the PCA against the Autoencoders.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
4. Background: Deformable Parts Model
• Strong low-level features based on
histograms of oriented gradients (HOG)
• Efficient matching algorithms for deformable part-
based models (pictorial structures)
• Discriminative learning with latent variables (latent
SVM)
• Where to look? Every where (the sliding window
approach)
• mean Average Precision (mAP): 33.7% - 33.4%
P.F. Felzenszwalb et al., “Object Detection with Discriminatively Trained Part-Based Models”, PAMI 2010.
J.J. Lim et al., “Sketch Tokens: A Learned Mid-level Representation for Contour and Object Detection”, CVPR 2013.
X. Ren et al., “Histograms of Sparse Codes for Object Detection”, CVPR 2013.
5. Background: Selective search
• Alternative to exhaustive search
with sliding window.
• Starting with over-segmentation,
merge similar regions and produce region
proposals.
van de Sande et al., “Segmentation as Selective Search for Object Recognition”, ICCV 2011.
6. Deep Learning happened, again!
Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS 2012.
ImageNet 2012 :whole-image classification with 1000 categories
Model Top-1(val) Top-5(val) Top-5(test)
1 CNN 40.7% 18.2% -
5 CNNs 38.1% 16.4% 16.4%
1 CNN (pre-trained) 39.0% 16.6% -
7 CNNs (pre-trained) 36.7% 15.4% 15.3%
• Can it be used in object recognition?
• Problems:
• localization: Where is the object?
• annotation: Labeled data is scarce.
• Expensive Computation for dense
search.
7. R-CNN: Region proposals + CNN
localization featureextraction classification
Approach Summery selective search deep learning
CNN
binary linear SVM
13. ConvNet
ConvNet
ConvNet
SVMs
Warped image regions
Forward each region
through ConvNet
Bbox reg
Bbox reg
Bbox reg SVMs
SVMs
Apply boundingboxregressors
Classify regions withSVMs
Regions of Interest (RoI)
from a proposal method
(~2k)
Input image
R-CNN
Girshick et al. CVPR14.
15. • Ad hoc training objectives
• Fine-tune network with softmax classifier (log loss)
• Train post-hoc linear SVMs (hingeloss)
• Train post-hoc bounding-box regressors (squaredloss)
What’s wrong with R-CNN?
16. • Ad hoc training objectives
• FineHtunenetwork with softmax classifier (log loss)
• Train postHhoclinear SVMs (hingeloss)
• Train postHhocboundingHbox regressors (squaredloss)
• Training is slow (84h), takes a lot of disk space
What’s wrong with R-CNN?
17. • Ad hoc training objectives
• FineHtune network with softmax classifier (log loss)
• Train postHhoclinear SVMs (hingeloss)
• Train postHhocboundingHboxregressions (least squares)
• Training is slow (84h), takes a lot of disk space
• Inference (detection) is slow
• 47s / image with VGG16 [Simonyan & Zisserman. ICLR15]
• Fixed by SPP-net[He et al. ECCV14]
~2000 ConvNet forward passes per image
What’s wrong with R-CNN?
20. ConvNet
Input image
Forward whole image through ConvNet
“conv5” feature map of imageRegions of
Interest (RoIs)
from a proposal
method
SPP-net
He et al. ECCV14.
21. ConvNet
Input image
Forward whole image through ConvNet
“conv5” feature map of imageRegions of
Interest (RoIs)
from a proposal
method
Spatial Pyramid Pooling (SPP) layer
SPP-net
He et al. ECCV14.
22. Input image
Regions of
Interest (RoIs)
from a proposal
method
ConvNet
SVMs Classify regions withSVMs
FullyHconnected layers
Spatial Pyramid Pooling (SPP) layer
“conv5” feature map of image
Forward whole image through ConvNet
FCs
SPP-net
He et al. ECCV14.
23. Input image
Regions of
Interest (RoIs)
from a proposal
method
ConvNet
SVMs Classify regions withSVMs
FullyHconnected layers
Spatial Pyramid Pooling (SPP) layer
“conv5” feature map of image
Forward whole image through ConvNet
FCs
Bbox reg
Apply boundingbox regressorsSPP-net
He et al. ECCV14.
24. What’s good about SPP-net?
• Fixes one issue with R-CNN:makes testing fast
ConvNet
SVMs
FCs
Bbox reg
Region-wise
computation
Image-wise
computation
(shared)
25. What’s wrong with SPP-net?
• Inherits the rest of R-CNN’sproblems
• Ad hoc trainingobjectives
• Training is slow (25h), takes a lot of disk space
• Introduces a new problem: cannot update
parameters below SPP layer during training
26. SPP-net: the main limitation
ConvNet
He et al. ECCV14.
SVMs
Trainable
(3 layers)
Frozen
(13 layers)
FCs
Bbox reg
SPPisnotdifferentiable
28. Fast R-CNN
• Fast test-time,like SPP-net
• One network, trained in one stage
29. Fast R-CNN
• Fast test-time,like SPP-net
• One network, trained in one stage
• Higher mean average precision than R-CNN and SPP-net
30. Fast R-CNN (test time)
ConvNet
Forward whole image through ConvNet
“conv5” feature map of imageRegions of
Interest (RoIs)
from a proposal
method
Input image
31. ConvNet
Forward whole image through ConvNet
“conv5” feature map of image
“RoI Pooling” (singleHlevel SPP) layer
Input image
Regions of
Interest (RoIs)
from a proposal
method
Fast R-CNN (test time)
32. Linear +
softmax
FCs FullyHconnected layers
“RoI Pooling” (singleHlevel SPP) layer
“conv5” feature map of image
Forward whole image through ConvNet
Input image
Softmax classifier
Regions of
Interest (RoIs)
from a proposal
method
ConvNet
Fast R-CNN (test time)
33. ConvNet
Forward whole image through ConvNet
“conv5” feature map of image
“RoI Pooling” (single-level SPP) layer
Linear +
softmax
FCs FullyHconnected layers
Softmax classifier
Regions of
Interest (RoIs)
from a proposal
method
Linear
Input image
Bounding-box regressors
Fast R-CNN (test time)
35. Log loss + smooth L1 loss
Linear +
softmax
FCs
Linear
ConvNet
Multi-taskloss
Fast R-CNN (training)
36. Log loss + smooth L1 loss
Linear +
softmax
FCs
Linear
Trainable
Multi-taskloss
ConvNet
Fast R-CNN (training)
37. What is missing from the previous
architectures?
• All the previous architectures relies on an external region
proposal algorithm.
• Proposed regions are independent from the network loss.
• No control over the regions quality.
39. Faster R-CNN
• Fast test-time,like FastR-CNN
• One network, trained in one stage
40. • Fast test-time,like FastR-CNN
• One network, trained in one stage
• Higher mean average precision than R-CNN,SPP-net,
Fast-RCNN
Faster R-CNN
41. • Fast test-time,like FastR-CNN
• One network, trained in one stage
• Higher mean average precision than R-CNN , SPP-
net, Fast-RCNN
• HaveadedicatedRegionProposalNetwork(RPN)trainedto
optimizethenetworkloss.
Faster R-CNN
51. Problem definition
• All region based CNN object detector are dependent on the quality of
the region proposal algorithm.
• Although in the Faster R-CNN, the region proposal network was trained
to minimize a multi-task loss function (log-loss and bounding-box
regression), still ,in my experiments, the best proposed regions are ill-
localized.
57. Better regions with Bayesian
Optimization
Now the goal becomes sampling new solution 𝑦 𝑛+1 with
high chance that it will maximizes the value of 𝑓𝑛+1
58. Better regions with Bayesian
Optimization
Given the ability to query a our CNN for region scores
we can repeat the following:
59. 1. Given existing regions/scores •
Better regions with Bayesian
Optimization
Given the ability to query a our CNN for region scores
we can repeat the following:
60. 1. Given existing regions/scores •
2. Wefit a model
Given the ability to query a our CNN for region scores
we can repeat the following:
Better regions with Bayesian
Optimization
61. 1. Given existing regions/scores •
2. Wefit a model
3. Introduce the chanceutility function
Given the ability to query a our CNN for region scores
we can repeat the following:
Better regions with Bayesian
Optimization
62. 1. Given existing regions/scores •
2. Wefit a model
3. Introduce the chanceutility function
4. Locatethe maximum of the utility
Given the ability to query a our CNN for region scores
we can repeat the following:
Better regions with Bayesian
Optimization
63. 1. Given existing regions/scores •
2. Wefit a model
3. Introduce the chanceutility function
4. Locatethe maximum of the utility
5. Observe the new regionscore
Given the ability to query a our CNN for region scores
we can repeat the following:
Better regions with Bayesian
Optimization
64. 1. Given existing regions/scores •
2. Wefit a model
3. Introduce the chanceutility function
4. Locatethe maximum of the utility
5. Observe the new regionscore
6. Update the model.
Given the ability to query a our CNN for region scores
we can repeat the following:
Better regions with Bayesian
Optimization
65. 1. Given existing regions/scores •
2. Wefit a model
3. Introduce the chanceutility function
4. Locatethe maximum of the utility
5. Observe the new regionscore
6. Update the model.
7. Repeatstep 2.
Given the ability to query a our CNN for region scores
we can repeat the following:
Better regions with Bayesian
Optimization
66. 1. Given existing regions/scores •
2. Wefit a model
3. Introduce the chanceutility function
4. Locatethe maximum of the utility
5. Observe the new regionscore
6. Update the model.
7. Repeatstep 2.
Given the ability to query a our CNN for region scores
we can repeat the following:
Better regions with Bayesian
Optimization
67. Example of BO applied
to R-CNN
Yuting Zhang, Kihyuk Sohn, Ruben Villegas, Gang Pan, and Honglak Lee.