This document summarizes the DeepLab models for semantic image segmentation: DeepLab v1 used atrous convolution with VGG-16 as the backbone network. DeepLab v2 improved on this with atrous spatial pyramid pooling and added ResNet-101 as an option. DeepLab v3 removed dense CRFs and introduced multi-grid atrous convolution and bootstrapping. DeepLab v3+ uses an encoder-decoder architecture with Xception or ResNet-101 as the backbone and atrous separable convolutions.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/auvizsystems/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Nagesh Gupta, Founder and CEO of Auviz Systems, presents the "Semantic Segmentation for Scene Understanding: Algorithms and Implementations" tutorial at the May 2016 Embedded Vision Summit.
Recent research in deep learning provides powerful tools that begin to address the daunting problem of automated scene understanding. Modifying deep learning methods, such as CNNs, to classify pixels in a scene with the help of the neighboring pixels has provided very good results in semantic segmentation. This technique provides a good starting point towards understanding a scene. A second challenge is how such algorithms can be deployed on embedded hardware at the performance required for real-world applications. A variety of approaches are being pursued for this, including GPUs, FPGAs, and dedicated hardware.
This talk provides insights into deep learning solutions for semantic segmentation, focusing on current state of the art algorithms and implementation choices. Gupta discusses the effect of porting these algorithms to fixed-point representation and the pros and cons of implementing them on FPGAs.
This document discusses techniques for super resolution image reconstruction from multiple low resolution images. There are three main approaches: interpolation-based, example-learning-based, and multi-image based. Multi-image based super resolution attempts to reconstruct the original high resolution image by using information from a set of observed low resolution images. Key steps involved in multi-image super resolution include image registration to determine displacements between images, modeling the imaging process, and using techniques like the Irani and Peleg algorithm to estimate the blurring function and reconstruct the high resolution image.
Semantic Segmentation Methods using Deep LearningSungjoon Choi
This document discusses semantic segmentation, which is the task of assigning each pixel in an image to a semantic class. It introduces semantic segmentation and provides a leader board of top performing models. It then details the results of various semantic segmentation models on benchmark datasets, including PSPNet, DeepLab v3+, and DeepLab v3. The models are evaluated based on metrics like mean intersection over union.
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...Joonhyung Lee
A presentation introducting DeepLab V3+, the state-of-the-art architecture for semantic segmentation. It also includes detailed descriptions of how 2D multi-channel convolutions function, as well as giving a detailed explanation of depth-wise separable convolutions.
This document provides an introduction to blind source separation and non-negative matrix factorization. It describes blind source separation as a method to estimate original signals from observed mixed signals. Non-negative matrix factorization is introduced as a constraint-based approach to solving blind source separation using non-negativity. The alternating least squares algorithm is described for solving the non-negative matrix factorization problem. Experiments applying these methods to artificial and real image data are presented and discussed.
Faster R-CNN improves object detection by introducing a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. The RPN slides over feature maps and predicts object bounds and objectness at each position. During training, anchors are assigned positive or negative labels based on Intersection over Union with ground truth boxes. Faster R-CNN runs the RPN in parallel with Fast R-CNN for detection, end-to-end in a single network and stage. This achieves state-of-the-art object detection speed and accuracy while eliminating computationally expensive selective search for proposals.
This document summarizes the DeepLab models for semantic image segmentation: DeepLab v1 used atrous convolution with VGG-16 as the backbone network. DeepLab v2 improved on this with atrous spatial pyramid pooling and added ResNet-101 as an option. DeepLab v3 removed dense CRFs and introduced multi-grid atrous convolution and bootstrapping. DeepLab v3+ uses an encoder-decoder architecture with Xception or ResNet-101 as the backbone and atrous separable convolutions.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/auvizsystems/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Nagesh Gupta, Founder and CEO of Auviz Systems, presents the "Semantic Segmentation for Scene Understanding: Algorithms and Implementations" tutorial at the May 2016 Embedded Vision Summit.
Recent research in deep learning provides powerful tools that begin to address the daunting problem of automated scene understanding. Modifying deep learning methods, such as CNNs, to classify pixels in a scene with the help of the neighboring pixels has provided very good results in semantic segmentation. This technique provides a good starting point towards understanding a scene. A second challenge is how such algorithms can be deployed on embedded hardware at the performance required for real-world applications. A variety of approaches are being pursued for this, including GPUs, FPGAs, and dedicated hardware.
This talk provides insights into deep learning solutions for semantic segmentation, focusing on current state of the art algorithms and implementation choices. Gupta discusses the effect of porting these algorithms to fixed-point representation and the pros and cons of implementing them on FPGAs.
This document discusses techniques for super resolution image reconstruction from multiple low resolution images. There are three main approaches: interpolation-based, example-learning-based, and multi-image based. Multi-image based super resolution attempts to reconstruct the original high resolution image by using information from a set of observed low resolution images. Key steps involved in multi-image super resolution include image registration to determine displacements between images, modeling the imaging process, and using techniques like the Irani and Peleg algorithm to estimate the blurring function and reconstruct the high resolution image.
Semantic Segmentation Methods using Deep LearningSungjoon Choi
This document discusses semantic segmentation, which is the task of assigning each pixel in an image to a semantic class. It introduces semantic segmentation and provides a leader board of top performing models. It then details the results of various semantic segmentation models on benchmark datasets, including PSPNet, DeepLab v3+, and DeepLab v3. The models are evaluated based on metrics like mean intersection over union.
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...Joonhyung Lee
A presentation introducting DeepLab V3+, the state-of-the-art architecture for semantic segmentation. It also includes detailed descriptions of how 2D multi-channel convolutions function, as well as giving a detailed explanation of depth-wise separable convolutions.
This document provides an introduction to blind source separation and non-negative matrix factorization. It describes blind source separation as a method to estimate original signals from observed mixed signals. Non-negative matrix factorization is introduced as a constraint-based approach to solving blind source separation using non-negativity. The alternating least squares algorithm is described for solving the non-negative matrix factorization problem. Experiments applying these methods to artificial and real image data are presented and discussed.
Faster R-CNN improves object detection by introducing a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. The RPN slides over feature maps and predicts object bounds and objectness at each position. During training, anchors are assigned positive or negative labels based on Intersection over Union with ground truth boxes. Faster R-CNN runs the RPN in parallel with Fast R-CNN for detection, end-to-end in a single network and stage. This achieves state-of-the-art object detection speed and accuracy while eliminating computationally expensive selective search for proposals.
This document provides an introduction to computer vision. It summarizes the state of the field, including popular challenges like PASCAL VOC and SRVC. It describes commonly used algorithms like SIFT for feature extraction and bag-of-words models. It also discusses machine learning methods applied to computer vision like support vector machines, randomized forests, boosting, and Viola-Jones face detection. Examples of results from applying these techniques to object classification problems are also provided.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/sept-2014-member-meeting-scottkrig
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Scott Krig, author of the book "Computer Vision Metrics: Survey, Taxonomy, and Analysis," delivers the presentation "Introduction to Feature Descriptors in Vision: From Haar to SIFT" at the September 2014 Embedded Vision Alliance Member Meeting.
The document discusses image segmentation techniques. It begins by defining segmentation as partitioning an image into distinct regions that correlate with objects or features of interest. The goal of segmentation is to find meaningful groups of pixels. Several segmentation techniques are described, including region growing/shrinking, clustering methods, and boundary detection. Region growing uses homogeneity tests to merge neighboring regions, while clustering divides space based on similarity within groups. Boundary detection finds boundaries between objects. The document provides examples and details of applying these segmentation methods.
Image processing involves manipulating digital images through algorithms implemented on computers. A digital image is composed of picture elements called pixels arranged in a grid. Each pixel represents a color or intensity value. Common image processing tasks include computer vision, optical character recognition, medical imaging, and more. Key concepts in image processing include pixels, resolution, color depth, and filtering/manipulating pixel values.
Slides by Amaia Salvador at the UPC Computer Vision Reading Group.
Source document on GDocs with clickable links:
https://docs.google.com/presentation/d/1jDTyKTNfZBfMl8OHANZJaYxsXTqGCHMVeMeBe5o1EL0/edit?usp=sharing
Based on the original work:
Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. "Faster R-CNN: Towards real-time object detection with region proposal networks." In Advances in Neural Information Processing Systems, pp. 91-99. 2015.
This document discusses morphological image processing using mathematical morphology. It begins with an introduction to morphology in biology and its application to image analysis using set theory. The key concepts of dilation, erosion, opening and closing are explained. Dilation expands object boundaries while erosion shrinks them. Opening performs erosion followed by dilation to smooth contours, and closing performs dilation followed by erosion to fill small holes. Structuring elements determine the shape and size of operations. Morphological operations are useful for tasks like boundary extraction, noise removal, and feature detection.
One-stage Network(YOLO, SSD 등)의 문제점 예를 들어 근본적인 문제인 # of Hard positives(object) << # of Easy negatives(back ground) 또는 large object 와 small object 를 동시에 detect하는 경우 등과 같이 극단적인 Class 간 unbalance나 난이도에서 차이가 나는 문제가 동시에 존재함으로써 발생하는 문제를 해결하기 위하여 제시된 Focal loss를 class간 아주 극단적인 unbalance data에 대한 classification 문제(예를 들어 1:10이나 1:100)에 적용한 실험결과가 있어서 정리해봤습니다. 결과적으로 hyper parameter의 설정에 매우 민감하다는 실험결과와 잘만 활용할 경우, class간 unbalance를 해결하기 위한 data level의 sampling 방법이나 classifier level에서의 특별한 고려 없이 좋은 결과를 얻을 수 있다는 내용입니다.
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...Sergio Orts-Escolano
Slides used for the thesis defense of the PhD candidate Sergio Orts-Escolano.
The research described in this thesis was motivated by the need of a robust model capable of representing 3D data obtained with 3D sensors, which are inherently noisy. In addition, time constraints have to be considered as these sensors are capable of providing a 3D data stream in real time.This thesis proposed the use of Self-Organizing Maps (SOMs) as a 3D representation model. In particular, we proposed the use of the Growing Neural Gas (GNG) network, which has been successfully used for clustering, pattern recognition and topology representation of multi-dimensional data. Until now, Self-Organizing Maps have been primarily computed offline and their application in 3D data has mainly focused on free noise models, without considering time constraints. It is proposed a hardware implementation leveraging the computing power of modern GPUs, which takes advantage of a new paradigm coined as General-Purpose Computing on Graphics Processing Units (GPGPU). The proposed methods were applied to different problems and applications in the area of computer vision such as the recognition and localization of objects, visual surveillance or 3D reconstruction.
- Image classification involves training a classifier on labeled images, validating hyperparameters, and testing on unlabeled images.
- Nearest neighbor classification predicts labels of nearest training examples while linear classification learns weights to separate classes with a hyperplane.
- Loss functions like cross-entropy measure how well the classifier's predicted scores match the true labels and are minimized during training.
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation岳華 杜
This document discusses several semantic segmentation methods using deep learning, including fully convolutional networks (FCNs), U-Net, and SegNet. FCNs were among the first to use convolutional networks for dense, pixel-wise prediction by converting classification networks to fully convolutional form and combining coarse and fine feature maps. U-Net and SegNet are encoder-decoder architectures that extract high-level semantic features from the input image and then generate pixel-wise predictions, with U-Net copying and cropping features and SegNet using pooling indices for upsampling. These methods demonstrate that convolutional networks can effectively perform semantic segmentation through dense prediction.
Machine Learning - Convolutional Neural NetworkRichard Kuo
The document provides an overview of convolutional neural networks (CNNs) for visual recognition. It discusses the basic concepts of CNNs such as convolutional layers, activation functions, pooling layers, and network architectures. Examples of classic CNN architectures like LeNet-5 and AlexNet are presented. Modern architectures such as Inception and ResNet are also discussed. Code examples for image classification using TensorFlow, Keras, and Fastai are provided.
MLPfit is a tool for designing and training multi-layer perceptrons (MLPs) for tasks like function approximation and classification. It implements stochastic minimization as well as more powerful methods like conjugate gradients and BFGS. MLPfit is designed to be simple, precise, fast and easy to use for both standalone and integrated applications. Documentation and source code are available online.
Las aplicaciones de Inteligencia Artificial como Machine Learning y Deep Learning se han convertido en parte importante en nuestras vidas. Los productos que compramos, si somos o no aptos para un préstamo bancario, las películas o series que Netflix nos recomienda, coches autoconducidos, reconocimiento de objetos, etc; toda esa información es dirigida hacia nosotros por estos algoritmos.
En la actualidad, estos campos de estudio son los más apasionantes y retadores en computación debido a su alto nivel de complejidad y gran demanda en el mercado. En esta presentación vamos a conocer y aprender a diferenciar estos conceptos, ya que son herramientas inevitables para el mejoramiento de la vida humana.
A continuación, te presentamos algunos de los temas específicos que se expondrán:
- Contexto de ML y DL en Inteligencia Artificial.
- Machine Learning.
- Supervised Learning.
- Unsupervised Learning.
- Deep Learning.
- Artificial Neural Network.
- Convolutional Neural Networks.
- Aplicaciones en ML y DL.
Machine Learning using Support Vector MachineMohsin Ul Haq
This document provides an overview of machine learning using support vector machines (SVM). It first defines machine learning as a field that allows computers to learn without explicit programming. It then describes the main types of machine learning: supervised learning using labelled training data, unsupervised learning to find hidden patterns in unlabelled data, and reinforcement learning to maximize rewards. SVM is introduced as a classification algorithm that finds the optimal separating hyperplane between classes with the largest margin. Kernels are discussed as functions that enable SVMs to operate in high-dimensional implicit feature spaces without explicitly computing coordinates.
This document discusses digital image processing concepts including:
- Image acquisition and representation, including sampling and quantization of images. CCD arrays are commonly used in digital cameras to capture images as arrays of pixels.
- A simple image formation model where the intensity of a pixel is a function of illumination and reflectance at that point. Typical ranges of illumination and reflectance are provided.
- Image interpolation techniques like nearest neighbor, bilinear, and bicubic interpolation which are used to increase or decrease the number of pixels in a digital image. Examples of applying these techniques are shown.
- Basic relationships between pixels including adjacency, paths, regions, boundaries, and distance measures like Euclidean, city block, and
This document provides an overview of digital image processing and image compression techniques. It defines what a digital image is, discusses the advantages and disadvantages of digital images over analog images. It describes the fundamental steps in digital image processing as well as types of data redundancy that can be exploited for image compression, including coding, interpixel, and psychovisual redundancy. Common image compression models and lossless compression techniques like Lempel-Ziv-Welch coding are also summarized.
This document provides an introduction to computer vision. It summarizes the state of the field, including popular challenges like PASCAL VOC and SRVC. It describes commonly used algorithms like SIFT for feature extraction and bag-of-words models. It also discusses machine learning methods applied to computer vision like support vector machines, randomized forests, boosting, and Viola-Jones face detection. Examples of results from applying these techniques to object classification problems are also provided.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/sept-2014-member-meeting-scottkrig
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Scott Krig, author of the book "Computer Vision Metrics: Survey, Taxonomy, and Analysis," delivers the presentation "Introduction to Feature Descriptors in Vision: From Haar to SIFT" at the September 2014 Embedded Vision Alliance Member Meeting.
The document discusses image segmentation techniques. It begins by defining segmentation as partitioning an image into distinct regions that correlate with objects or features of interest. The goal of segmentation is to find meaningful groups of pixels. Several segmentation techniques are described, including region growing/shrinking, clustering methods, and boundary detection. Region growing uses homogeneity tests to merge neighboring regions, while clustering divides space based on similarity within groups. Boundary detection finds boundaries between objects. The document provides examples and details of applying these segmentation methods.
Image processing involves manipulating digital images through algorithms implemented on computers. A digital image is composed of picture elements called pixels arranged in a grid. Each pixel represents a color or intensity value. Common image processing tasks include computer vision, optical character recognition, medical imaging, and more. Key concepts in image processing include pixels, resolution, color depth, and filtering/manipulating pixel values.
Slides by Amaia Salvador at the UPC Computer Vision Reading Group.
Source document on GDocs with clickable links:
https://docs.google.com/presentation/d/1jDTyKTNfZBfMl8OHANZJaYxsXTqGCHMVeMeBe5o1EL0/edit?usp=sharing
Based on the original work:
Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. "Faster R-CNN: Towards real-time object detection with region proposal networks." In Advances in Neural Information Processing Systems, pp. 91-99. 2015.
This document discusses morphological image processing using mathematical morphology. It begins with an introduction to morphology in biology and its application to image analysis using set theory. The key concepts of dilation, erosion, opening and closing are explained. Dilation expands object boundaries while erosion shrinks them. Opening performs erosion followed by dilation to smooth contours, and closing performs dilation followed by erosion to fill small holes. Structuring elements determine the shape and size of operations. Morphological operations are useful for tasks like boundary extraction, noise removal, and feature detection.
One-stage Network(YOLO, SSD 등)의 문제점 예를 들어 근본적인 문제인 # of Hard positives(object) << # of Easy negatives(back ground) 또는 large object 와 small object 를 동시에 detect하는 경우 등과 같이 극단적인 Class 간 unbalance나 난이도에서 차이가 나는 문제가 동시에 존재함으로써 발생하는 문제를 해결하기 위하여 제시된 Focal loss를 class간 아주 극단적인 unbalance data에 대한 classification 문제(예를 들어 1:10이나 1:100)에 적용한 실험결과가 있어서 정리해봤습니다. 결과적으로 hyper parameter의 설정에 매우 민감하다는 실험결과와 잘만 활용할 경우, class간 unbalance를 해결하기 위한 data level의 sampling 방법이나 classifier level에서의 특별한 고려 없이 좋은 결과를 얻을 수 있다는 내용입니다.
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...Sergio Orts-Escolano
Slides used for the thesis defense of the PhD candidate Sergio Orts-Escolano.
The research described in this thesis was motivated by the need of a robust model capable of representing 3D data obtained with 3D sensors, which are inherently noisy. In addition, time constraints have to be considered as these sensors are capable of providing a 3D data stream in real time.This thesis proposed the use of Self-Organizing Maps (SOMs) as a 3D representation model. In particular, we proposed the use of the Growing Neural Gas (GNG) network, which has been successfully used for clustering, pattern recognition and topology representation of multi-dimensional data. Until now, Self-Organizing Maps have been primarily computed offline and their application in 3D data has mainly focused on free noise models, without considering time constraints. It is proposed a hardware implementation leveraging the computing power of modern GPUs, which takes advantage of a new paradigm coined as General-Purpose Computing on Graphics Processing Units (GPGPU). The proposed methods were applied to different problems and applications in the area of computer vision such as the recognition and localization of objects, visual surveillance or 3D reconstruction.
- Image classification involves training a classifier on labeled images, validating hyperparameters, and testing on unlabeled images.
- Nearest neighbor classification predicts labels of nearest training examples while linear classification learns weights to separate classes with a hyperplane.
- Loss functions like cross-entropy measure how well the classifier's predicted scores match the true labels and are minimized during training.
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation岳華 杜
This document discusses several semantic segmentation methods using deep learning, including fully convolutional networks (FCNs), U-Net, and SegNet. FCNs were among the first to use convolutional networks for dense, pixel-wise prediction by converting classification networks to fully convolutional form and combining coarse and fine feature maps. U-Net and SegNet are encoder-decoder architectures that extract high-level semantic features from the input image and then generate pixel-wise predictions, with U-Net copying and cropping features and SegNet using pooling indices for upsampling. These methods demonstrate that convolutional networks can effectively perform semantic segmentation through dense prediction.
Machine Learning - Convolutional Neural NetworkRichard Kuo
The document provides an overview of convolutional neural networks (CNNs) for visual recognition. It discusses the basic concepts of CNNs such as convolutional layers, activation functions, pooling layers, and network architectures. Examples of classic CNN architectures like LeNet-5 and AlexNet are presented. Modern architectures such as Inception and ResNet are also discussed. Code examples for image classification using TensorFlow, Keras, and Fastai are provided.
MLPfit is a tool for designing and training multi-layer perceptrons (MLPs) for tasks like function approximation and classification. It implements stochastic minimization as well as more powerful methods like conjugate gradients and BFGS. MLPfit is designed to be simple, precise, fast and easy to use for both standalone and integrated applications. Documentation and source code are available online.
Las aplicaciones de Inteligencia Artificial como Machine Learning y Deep Learning se han convertido en parte importante en nuestras vidas. Los productos que compramos, si somos o no aptos para un préstamo bancario, las películas o series que Netflix nos recomienda, coches autoconducidos, reconocimiento de objetos, etc; toda esa información es dirigida hacia nosotros por estos algoritmos.
En la actualidad, estos campos de estudio son los más apasionantes y retadores en computación debido a su alto nivel de complejidad y gran demanda en el mercado. En esta presentación vamos a conocer y aprender a diferenciar estos conceptos, ya que son herramientas inevitables para el mejoramiento de la vida humana.
A continuación, te presentamos algunos de los temas específicos que se expondrán:
- Contexto de ML y DL en Inteligencia Artificial.
- Machine Learning.
- Supervised Learning.
- Unsupervised Learning.
- Deep Learning.
- Artificial Neural Network.
- Convolutional Neural Networks.
- Aplicaciones en ML y DL.
Machine Learning using Support Vector MachineMohsin Ul Haq
This document provides an overview of machine learning using support vector machines (SVM). It first defines machine learning as a field that allows computers to learn without explicit programming. It then describes the main types of machine learning: supervised learning using labelled training data, unsupervised learning to find hidden patterns in unlabelled data, and reinforcement learning to maximize rewards. SVM is introduced as a classification algorithm that finds the optimal separating hyperplane between classes with the largest margin. Kernels are discussed as functions that enable SVMs to operate in high-dimensional implicit feature spaces without explicitly computing coordinates.
This document discusses digital image processing concepts including:
- Image acquisition and representation, including sampling and quantization of images. CCD arrays are commonly used in digital cameras to capture images as arrays of pixels.
- A simple image formation model where the intensity of a pixel is a function of illumination and reflectance at that point. Typical ranges of illumination and reflectance are provided.
- Image interpolation techniques like nearest neighbor, bilinear, and bicubic interpolation which are used to increase or decrease the number of pixels in a digital image. Examples of applying these techniques are shown.
- Basic relationships between pixels including adjacency, paths, regions, boundaries, and distance measures like Euclidean, city block, and
This document provides an overview of digital image processing and image compression techniques. It defines what a digital image is, discusses the advantages and disadvantages of digital images over analog images. It describes the fundamental steps in digital image processing as well as types of data redundancy that can be exploited for image compression, including coding, interpixel, and psychovisual redundancy. Common image compression models and lossless compression techniques like Lempel-Ziv-Welch coding are also summarized.
사내 스터디용으로 공부하며 만든 발표 자료입니다. 부족한 부분이 있을 수도 있으니 알려주시면 정정하도록 하겠습니다.
*슬라이드 6에 나오는 classical CNN architecture(뒤에도 계속 나옴)에서 ReLU - Pool - ReLu에서 뒤에 나오는 ReLU는 잘못된 표현입니다. ReLU - Pool에서 ReLU 계산을 또 하는 건 redundant 하기 때문입니다(Kyung Mo Kweon 피드백 감사합니다)
Deep Learning Into Advance - 1. Image, ConvNetHyojun Kim
[본 자료는 AB180 사내 스터디의 일환으로 제작되었습니다.]
딥러닝에 대한 기초적인 이해 및 적용 예시를 알아보고, 인사이트를 공유하기 위해 만들었습니다. 첫번째로 딥러닝이 이미지 프로세싱에 적용된 방식 및, Convolutional Neural Network (ConvNet)의 기초에 대해 다루었습니다.
* 본 스터디 자료는 Stanford 강좌인 CS231n (http://cs231n.stanford.edu)의 내용을 참고했습니다.
2. TLDR;
1. Semantic Segmentation 분야에서 FCN 이라는 Encoder(CNN)-Decoder 구조의
새로운 패러다임이 등장함 .
2. U-Net 이 등장함 . Skip Connection, gradually up/down sampling 이 구조에 추가 되었으며
왠지는 모르겠지만 많은 논문들에서 “ U-Net architecture” 라는 이름으로 segmentation
Network 를 사용함 .
3. 옛날 알고리즘 보다는 좋지만 FCN(+U-Net) 의 가장 큰 문제 (a.k.a. 개선 가능점 ) 는 Pooling.
Pooling 의 역할은 Exponential expansion of receptive field.
Pooling 의 문제점은 Feature map 의 크기의 축소 , 위치 정보의 손실 .
4. Pooling 의 역할을 대체해보자 ! → Dilated(Atrous) convolution.
Exponential expansion of receptive field 을 구조적 변경으로 가능하게 함 . 성능 저하도 없음 .
Feature map 크기 축소 문제 해결 !
5. Pooling 할 때 filter 크기 별로 위치 정보 손실이 다르지 않을까 ? 그럼 , 다양한 크기로 pooling 한
뒤에 합쳐보자 . → Spatial Pyramid Pooling.
6. 위에서 사용한 내용들 , skip connection, dilated convolution, spatial pyramid pooling 을
다 함께 사용하자 . + 좋은 Encoder → DeepLab.v3+ ( 현재 PASCAL VOC 2012 1 등 )
7. 아주 짧은 내용만을 다뤘기 때문에 내용을 참고하셔서 많은 논문을 보시면 좋겠습니다 .
3. Outline
Part.1: Encoder – Decoder 란 ?
Part.2: 위치 정보를 잘 보존하려면 ?
Part.3: End-to-End Semantic Segmentation
의 재료들
9. Fully Convolutional Network!
Encoder Decoder
CNN 은 Encoder 로써 잘 작동하므
로 Feature map 에 각 픽셀의 정
보가 압축되어 있다고 해보자 .
압축된 정보가 Decoder 를 통하
면 픽셀의 위치 정보를 얻을 수 있
지 않을까 ? Yes!
Long et al. Fully convolutional networks for semantic segmentation. CVPR, 2015.
11. Outline – Part 2.
1. FCN 의 문제점 ?
2. En--------coder De--------coder 구조 (U-Net)
3. Dilated Convolution (Dilated Net, DeepLab.v2)
4. Spatial Pyramid Pooling (PSPNet, DeepLab.v3,+)
12. Fully Convolutional Network?
x32
Long et al. Fully convolutional networks for semantic segmentation. CVPR, 2015.
Feature map 의 값이 대응되는 pixel 개수가
너무 많습니다 ! 위치 정보가 세세하게 보존되
기 어려워요 .
13. 문제 : Pooling layer!
VGG-19(FCN Encoder)
Image
Conv/
Pool
Conv/
Pool
Conv/
Pool
Conv/
Pool
Conv/
Pool FC
Pooling 의 역할 :
- Exponential expansion of
receptive field
- Translation invariance
Pooling 의 문제점 :
- Feature map 의 축소
- 위치 정보의 손실
14. En—coder De—coder (a.k.a. U-net architecture)
단계적
Encoding
단계적
Decoding
앞선 정보를 전달하자 !
(skip connection)
Ronneberger et al, U-net: Convolutional networks for biomedical image segmentation. MICCAI, 2015.
좋은 방법들을 사용하긴 했는데 그래도 여전히
마지막 Feature map 이 원본 이미지에 비해 너
무 작은 문제는 그대로 있네요 .
16. Dilated(Atrous) Convolution
Yu, Koltun et al. Multi-Scale Context Aggregation by Dilated Convolutions. ILCR, 2016
Layer 1 2 3
Convolution 3x3 3x3 3x3
Dilation 1 1 1
Receptive field 3x3 5x5 5x5
Layer 1 2 3
Convolution 3x3 3x3 3x3
Dilation 1 2 4
Receptive field 3x3 7x7 15x15
vs
Receptive Field 비교 (Normal vs Dilated)
Exponential expansion of receptive field!
1 2 3
17. Dilated(Atrous) Convolution
Input/Final feature
map : 1/32
Input/Final feature
map: 1/8
Feature map 크기 기존 대비 4 배 보존 !
Chen et al. Rethinking atrous convolution for semantic image segmentation. arXiv, 2017
Feature map 비교 (Normal vs Dilated)
18. Spatial Pyramid Pooling
He et al. Spatial pyramid pooling in deep convolutional networks for visual recognition. ECCV, 2014.
Pooling 할 때 생기는 위치 정보 손실이
filter 크기 마다 다르지 않을까요 ?
Filter 크기 별로 정보를 추출한 뒤에
합쳐서 위치 정보 손실을 최소화해봅시다 .
19. Atrous Convolution + Spatial Pyramid Pooling!
Spatial Pyramid Pooling!
Chen et al. Rethinking atrous convolution for semantic image segmentation. ArXiv, 2017.
Zhao et al. Pyramid scene parsing network. CVPR, 2017.
24. Outline – Part 3.
1. 데이터 준비 , 전처리
2. 모델 선정
3. Loss, Optimizer 선정
4. 평가 (Metrics)
25. 데이터 전처리
- 전처리는 classification 과 다르게 특별한 건 없습니다 .
대신 augmentation 할 때 image-mask 쌍으로 해줘야
합니 다 !
26. Loss
- Cross Entropy Loss
Optimizer
- SGD with momentum (+ Nesterov)
Learning rate
- Poly learning rate policy
(PSPNet, DeepLab.v2~v3+)
27. 평가 방법 (Pixel)
- IoU: B / (A + C - B)
- Pixel accuracy: B / A
A
B
C
예측
정답
예측 성공 !
28. 평가 방법 (Object)
- Precision/Recall: IoU >= 0.5
- AP: IoU 기준 (0~1.0) 에 따른
Precision/Recall Curve 의 면적
- mAP: 모든 class 의 AP 평균
A
A
A’
C
C
C’
IoU = 0.7
IoU = 0.2
Success(TP)
Fail(FN)
AP AP → mAP
Source: https://github.com/Cartucho/mAP
A C
C’
29. 빠진 내용
1. Post preprocess – CRF, ...
2. Dilated Conv, Upsampling 에 대한 상세 이해
3. 다른 분야와의 접목된 연구 결과 (e.g. pix2pix)
… 채워주세요 !
30. Reference
1. 모델
- He et al. Spatial pyramid pooling in deep convolutional networks for
visual recognition. ECCV, 2014.
- Long et al. Fully convolutional networks for semantic segmentation.
CVPR, 2015.
- Ronneberger et al, U-net: Convolutional networks for biomedical image
segmentation. MICCAI, 2015.
- Yu, Koltun et al. Multi-Scale Context Aggregation by Dilated Convolutions.
ILCR, 2016
- Zhao et al. Pyramid scene parsing network. CVPR, 2017.
- Chen et al. Rethinking atrous convolution for semantic image segmentation.
ArXiv, 2017
- Chen et al. Encoder-Decoder with Atrous Separable Convolution for
Semantic Image Segmentation. ArXiv, 2018.
31. Reference
2. 참고 자료
– FCN – PSPNet Pytorch 구현
(https://github.com/ZijunDeng/pytorch-semantic-segmentation)
- 평가 지표 Python 구현
(https://github.com/martinkersner/py_img_seg_eval)
- DeepLab Pytorch 구현
(https://github.com/doiken23/DeepLab_pytorch)
- Deconvolution 설명 – Distill
(https://distill.pub/2016/deconv-checkerboard/)
- FCN to DeepLab.v3 정리 블로그
(http://blog.qure.ai/notes/semantic-segmentation-deep-learning-review)
- PASCAL VOC 2012 Semantic Segmentation 평가 결과
(http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?cls=mean
&challengeid=11&compid=6&submid=8284
32. Reference
– Dilated Convolution 설명
(https://stackoverflow.com/questions/41178576/whats-the-use-of-dilated-
convolutions)
- Spatial Pyramid Pooling 설명
(https://www.quora.com/What-is-the-difference-between-simple-max-
Pooling-and-spatial-pyramid-pooling-Im-seeing-these-terms-a-lot-lately-
In-papers-where-the-authors-need-to-get-a-feature-vector)
- Receptive field 설명
(https://medium.com/mlreview/a-guide-to-receptive-field-arithmetic-for
-convolutional-neural-networks-e0f514068807)
- Dilated Convolution 유무 성능 비교 , 발생 문제 (gridding artifact) 해결
(https://arxiv.org/abs/1705.09914)