This presentation was prepared for ViPr Reading group at Multimedia University, Cyberjaya. The goal of this presentation was to make aware the lab members about the recent advancements in action recognition.
ImageNet classification with deep convolutional neural networks(2012)WoochulShin10
1) The document describes a study that trained one of the largest convolutional neural networks on the ImageNet dataset.
2) It implemented highly optimized GPU training of large CNNs on high resolution images and introduced features like ReLU, local response normalization, and overlapping pooling to improve performance and reduce overfitting.
3) The network architecture consisted of 5 convolutional layers and 3 fully-connected layers and was trained on two GPUs with techniques like dropout and data augmentation to reduce overfitting.
Visualizing and understanding convolutional networks(2014)WoochulShin10
(1) The document discusses techniques for visualizing and understanding convolutional networks, including deconvolutional networks to project activations back to the input space and occlusion sensitivity analysis.
(2) The approach involves using a deconvolutional network to map activations in intermediate layers back to the input pixel space to show what patterns cause activations. Training details of modifying AlexNet for dense connections are also described.
(3) Visualizing features reveals their increasing invariance at higher layers, exaggeration of discriminative parts, and evolution over training. Visualization helped select better architectures and analyze occlusion sensitivity and correspondence.
This document analyzes KinectFusion, a real-time 3D reconstruction system using a moving depth camera. It introduces SLAMBench, a benchmarking framework for KinectFusion. The document describes the KinectFusion pipeline including preprocessing, tracking, integration and raycasting steps. It evaluates several RGB-D datasets and identifies the Washington RGB-D Scenes dataset as most suitable. It notes drawbacks in KinectFusion like noisy trajectories and inconsistent models. Future work proposed is reducing tracking noise using a Kalman filter.
Real-time large scale dense RGB-D SLAM with volumetric fusion extends KinectFusion to larger scales. It represents the volumetric reconstruction as a rolling buffer that translates as the camera moves. It estimates camera pose through combined geometric and photometric constraints. It closes loops by non-rigidly deforming the map with constraints from loop closures and jointly optimizes the camera poses and map. Evaluation shows it produces large, globally consistent, real-time dense reconstructions.
FastCampus 2018 SLAM Workshop
You can find the code diagrams via the link below.
https://www.dropbox.com/sh/u76i5hzdecd4ey7/AADgs9XzXt6k1j971vyBrFTea?dl=0
MediaEval 2016 - Placing Images with Refined Language Models and Similarity S...multimediaeval
Presenter: Giorgos Kordopatis-Zilos
Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features In Working Notes Proceedings of the MediaEval 2016 Workshop, Hilversum, Netherlands, October 20-21, CEUR-WS.org (2016) by Giorgos Kordopatis-Zilos, Adrian Popescu, Symeon Papadopoulos, Yiannis Kompatsiaris
Paper: http://ceur-ws.org/Vol-1739/MediaEval_2016_paper_13.pdf
Video: https://youtu.be/WR4I3CWjcR4
Abstract: We describe the participation of the CERTH/CEA-LIST team in the MediaEval 2016 Placing Task. We submitted five runs to the estimation-based sub-task: one based only on text by employing a Language Model-based approach with several refinements, one based on visual content, using geospatial clustering over the most visually similar images, and three based on a hybrid scheme exploiting both visual and textual cues from the multimedia items, trained on datasets of different size and origin. The best results were obtained by a hybrid approach trained with external training data and using two publicly available gazetteers.
Fast Full Search for Block Matching Algorithmsijsrd.com
This project introduces configurable motion estimation architecture for a wide range of fast block-matching algorithms (BMAs). Contemporary motion estimation architectures are either too rigid for multiple BMAs or the flexibility in them is implemented at the cost of reduced performance. In block-based motion estimation, a block-matching algorithm (BMA) searches for the best matching block for the current macro block from the reference frame. During the searching procedure, the checking point yielding the minimum block distortion (MBD) determines the displacement of the best matching block.
ImageNet classification with deep convolutional neural networks(2012)WoochulShin10
1) The document describes a study that trained one of the largest convolutional neural networks on the ImageNet dataset.
2) It implemented highly optimized GPU training of large CNNs on high resolution images and introduced features like ReLU, local response normalization, and overlapping pooling to improve performance and reduce overfitting.
3) The network architecture consisted of 5 convolutional layers and 3 fully-connected layers and was trained on two GPUs with techniques like dropout and data augmentation to reduce overfitting.
Visualizing and understanding convolutional networks(2014)WoochulShin10
(1) The document discusses techniques for visualizing and understanding convolutional networks, including deconvolutional networks to project activations back to the input space and occlusion sensitivity analysis.
(2) The approach involves using a deconvolutional network to map activations in intermediate layers back to the input pixel space to show what patterns cause activations. Training details of modifying AlexNet for dense connections are also described.
(3) Visualizing features reveals their increasing invariance at higher layers, exaggeration of discriminative parts, and evolution over training. Visualization helped select better architectures and analyze occlusion sensitivity and correspondence.
This document analyzes KinectFusion, a real-time 3D reconstruction system using a moving depth camera. It introduces SLAMBench, a benchmarking framework for KinectFusion. The document describes the KinectFusion pipeline including preprocessing, tracking, integration and raycasting steps. It evaluates several RGB-D datasets and identifies the Washington RGB-D Scenes dataset as most suitable. It notes drawbacks in KinectFusion like noisy trajectories and inconsistent models. Future work proposed is reducing tracking noise using a Kalman filter.
Real-time large scale dense RGB-D SLAM with volumetric fusion extends KinectFusion to larger scales. It represents the volumetric reconstruction as a rolling buffer that translates as the camera moves. It estimates camera pose through combined geometric and photometric constraints. It closes loops by non-rigidly deforming the map with constraints from loop closures and jointly optimizes the camera poses and map. Evaluation shows it produces large, globally consistent, real-time dense reconstructions.
FastCampus 2018 SLAM Workshop
You can find the code diagrams via the link below.
https://www.dropbox.com/sh/u76i5hzdecd4ey7/AADgs9XzXt6k1j971vyBrFTea?dl=0
MediaEval 2016 - Placing Images with Refined Language Models and Similarity S...multimediaeval
Presenter: Giorgos Kordopatis-Zilos
Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features In Working Notes Proceedings of the MediaEval 2016 Workshop, Hilversum, Netherlands, October 20-21, CEUR-WS.org (2016) by Giorgos Kordopatis-Zilos, Adrian Popescu, Symeon Papadopoulos, Yiannis Kompatsiaris
Paper: http://ceur-ws.org/Vol-1739/MediaEval_2016_paper_13.pdf
Video: https://youtu.be/WR4I3CWjcR4
Abstract: We describe the participation of the CERTH/CEA-LIST team in the MediaEval 2016 Placing Task. We submitted five runs to the estimation-based sub-task: one based only on text by employing a Language Model-based approach with several refinements, one based on visual content, using geospatial clustering over the most visually similar images, and three based on a hybrid scheme exploiting both visual and textual cues from the multimedia items, trained on datasets of different size and origin. The best results were obtained by a hybrid approach trained with external training data and using two publicly available gazetteers.
Fast Full Search for Block Matching Algorithmsijsrd.com
This project introduces configurable motion estimation architecture for a wide range of fast block-matching algorithms (BMAs). Contemporary motion estimation architectures are either too rigid for multiple BMAs or the flexibility in them is implemented at the cost of reduced performance. In block-based motion estimation, a block-matching algorithm (BMA) searches for the best matching block for the current macro block from the reference frame. During the searching procedure, the checking point yielding the minimum block distortion (MBD) determines the displacement of the best matching block.
motion and feature based person tracking in survillance videosshiva kumar cheruku
The document summarizes and compares two common algorithms for person tracking in surveillance videos: background subtraction and frame difference. It then proposes a moving target detection algorithm based on background subtraction with a dynamic background. The background image is updated over time through superimposition of the current frame with the previous background image. This allows objects that remain stationary for a period of time to become part of the background. Experimental results showed this algorithm can detect and extract moving targets more effectively and precisely.
This document proposes an adaptive rood pattern search (ARPS) algorithm for fast block matching motion estimation. ARPS uses adjustable search patterns centered around a predicted motion vector value to quickly locate the minimum matching error point. It employs a checking bit-map and zero-motion prejudgment to avoid duplicate computations and benefit sequences with small motion. Experimental results show ARPS achieves 94-447x speedup over full search with less than 0.12dB drop in PSNR quality compared to full search for most sequences.
Introduction to Wavelet Transform and Two Stage Image DE noising Using Princi...ijsrd.com
In past two decades there are various techniques are developed to support variety of image processing applications. The applications of image processing include medical, satellite, space, transmission and storage, radar and sonar etc. But noise in image effect all applications. So it is necessary to remove noise from image. There are various methods and techniques are there to remove noise from images. Wavelet transform (WT) has been proved to be effective in noise removal but this have some problems that is overcome by PCA method. This paper presents an efficient image de-noising scheme by using principal component analysis (PCA) with local pixel grouping (LPG). This method provides better preservation of image local structures. In this method a pixel and its nearest neighbors are modeled as a vector variable whose training samples are selected from the local window by using block matching based LPG. In image de-noising, a compromise has to be found between noise reduction and preserving significant image details. PCA is a statistical technique for simplifying a dataset by reducing datasets to lower dimensions. It is a standard technique commonly used for data reduction in statistical pattern recognition and signal processing. This paper proposes a de-noising technique by using a new statistical approach, principal component analysis with local pixel grouping (LPG). This procedure is iterated second time to further improve the de-noising performance, and the noise level is adaptively adjusted in the second stage.
This document provides an overview and summary of a presentation on Simultaneous Localization and Mapping (SLAM). It introduces the speaker, Dong-Won Shin, and his background and research in SLAM. The contents of the presentation are then outlined, including an introduction to SLAM, traditional SLAM approaches like Extended Kalman Filter SLAM and FastSLAM, efforts towards large-scale mapping like graph-based SLAM and loop closure detection, modern state-of-the-art systems like ORB SLAM, KinectFusion and Lidar SLAM, and applications of SLAM. Key algorithms in visual odometry, backend optimization, and loop closure detection are also summarized.
Seed net automatic seed generation with deep reinforcement learning for robus...NAVER Engineering
본 논문에서는 interactive segmentation 문제를 풀기 위하여 deep reinforcement learning을 활용한 seed gereration 기법을 제안한다. Interactive segmentation 문제의 이슈 중 하나는 사용자의 개입을 최소화하는 것이다. 본 논문에서 제안하는 시스템이 사용자를 대신하여 인공적인 seed를 생성하게 된다. 사용자는 initial seed 정보만을 제공하면 된다. 우리는 optimal seed point 정의의 모호함으로 인해 supervised 기법을 사용하여 학습하기 어려운 점을 reinforcement learning 기법을 사용하여 극복하였다. Seed generation 문제에 맞도록 MDP를 정의하여 deep-q-network를 성공적으로 학습하였다. 우리는 MSRA10K 데이터셋에 대하여 학습을 진행하여 기존 segmentation 알고리즘의 부정확한 initial 결과 대비 우수한 성능을 보였다.
This document summarizes a project that used a deep learning model to predict depth images from single RGB images. It discusses existing solutions using stereo cameras or Kinect devices. The project used the NYU Depth V2 dataset, splitting it into training, validation, and test sets. It implemented a model based on previous work, training it on RGB-D image pairs for 35 epochs but achieving only moderate results due to limited training data. The code and results are available online for further exploration.
The document discusses background subtraction techniques for detecting moving objects in video frames. It introduces the mixture of Gaussians approach, which models each pixel as a combination of Gaussian distributions to determine if it belongs to the background or foreground. The key advantages of this approach are its robustness to repetitive motions and changes in lighting/weather. The document compares various techniques, then covers implementation details and challenges of applying mixture of Gaussians to an outdoor scene with moving vehicles and foliage.
Primal-dual coding photography is a new photographic technique that uses coded illumination and exposure to selectively record user-defined subsets of light paths, generalizing conventional photography. It modulates the contribution of specific light paths using a "probing matrix" of primal codes for illumination and dual codes for exposure over multiple frames. This allows effects like enhancing direct light, capturing indirect light of different ranges, separating light transport effects, and making 3D regions invisible or color-coded in the photo. The technique provides guarantees of optimality and convergence for reconstructing images from the coded photo measurements.
IRJET - Dehazing of Single Nighttime Haze Image using Superpixel MethodIRJET Journal
This document presents a new super-pixel based algorithm for removing haze from single nighttime images. It first decomposes the input hazy nighttime image into a glow image and glow-free hazy image using their relative smoothness. It then uses super-pixel segmentation to compute the atmospheric light and dark channel values for each pixel in the glow-free image. The transmission map is estimated from the dark channel using a weighted guided image filter. Compared to patch-based methods, using super-pixels can reduce morphological artifacts and allow a smaller filter radius to better preserve details. The proposed method is tested on nighttime hazy images and is able to effectively remove haze and restore clear nighttime scenes in 3 sentences or less
PhD defence public presentation, Bayesian methods for inverse problems with point clouds: applications to single-photon lidar, ENSEEHIT, Toulouse, France
Architecture Design for Deep Neural Networks IWanjin Yu
This document summarizes Gao Huang's presentation on neural architectures for efficient inference. The presentation covered three parts: 1) macro-architecture innovations in convolutional neural networks (CNNs) such as ResNet, DenseNet, and multi-scale networks; 2) micro-architecture innovations including group convolution, depthwise separable convolution, and attention mechanisms; and 3) moving from static networks to dynamic networks that can adaptively select simpler or more complex models based on input complexity. The key idea is to enable faster yet accurate inference by matching computational cost to input difficulty.
Deep Local Parametric Filters for Image EnhancementSean Moran
This document presents DeepLPF, a neural network architecture that can regress the parameters of learnable image filters to retouch and enhance input images. DeepLPF predicts the parameters of three types of filters - elliptical, graduated, and polynomial filters - that emulate common image editing tools. It achieves state-of-the-art performance on benchmark datasets while using a small number of neural network weights. The filters allow for interpretable, spatially localized adjustments to images.
발표자: 이준태 (고려대 박사과정)
발표일: 2017.9.
개요:
Algorithm to semantic line detection and its applications will be presented. First, I will introduce the concept of semantic line. Second, semantic line detection method will be described. Then, two applications will be presented: composition enhancement and image simplification.
The document discusses human action recognition using spatio-temporal features. It proposes using optical flow and shape-based features to form motion descriptors, which are then classified using Adaboost. Targets are localized using background subtraction. Optical flows within localized regions are organized into a histogram to describe motion. Differential shape information is also captured. The descriptors are used to train a strong classifier with Adaboost that can recognize actions in testing videos.
This document discusses a new technique called Perceptual Mixture of Gaussians (PMOG) for background subtraction in video analytics. PMOG incorporates characteristics of human visual perception to improve detection quality. It uses a realistic background value prediction based on the most recent observation. Detection threshold is set based on Weber's law and a just-noticeable difference threshold. Frame-level features are extracted from foreground blobs and transformed into temporal features for event detection. Experiments show PMOG has higher stability across environments and superior detection quality compared to other techniques. It provides an effective approach for abnormal event detection without requiring context information.
발표자: 고영준 (고려대 박사과정)
발표일: 2017.6.
개요:
Algorithms to segment objects in a video sequence will be presented.
First, I will introduce a primary object segmentation algorithm based on region augmentation and reduction. Second, collaborative detection, tracking, and segmentation for online multiple object segmentation will be presented.
Summary:
There are three parts in this presentation.
A. Why do we need Convolutional Neural Network
- Problems we face today
- Solutions for problems
B. LeNet Overview
- The origin of LeNet
- The result after using LeNet model
C. LeNet Techniques
- LeNet structure
- Function of every layer
In the following Github Link, there is a repository that I rebuilt LeNet without any deep learning package. Hope this can make you more understand the basic of Convolutional Neural Network.
Github Link : https://github.com/HiCraigChen/LeNet
LinkedIn : https://www.linkedin.com/in/YungKueiChen
(Paper Review)U-GAT-IT: unsupervised generative attentional networks with ada...MYEONGGYU LEE
The document introduces a new paper titled "U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation". It proposes a model that uses attention mechanisms to focus on discriminative regions between domains. A new normalization method called AdaLIN is also introduced to flexibly control the degree of shape and texture transformation without changing the model structure or hyperparameters. The model aims to learn mapping functions between unpaired source and target domains for tasks like selfie to anime image translation.
3D Shape and Indirect Appearance by Structured Light TransportMatthew O'Toole
3D Shape and Indirect Appearance by Structured Light Transport
Matthew O'Toole, John Mather, and Kiriakos N. Kutulakos. CVPR, 2014.
Abstract:
We consider the problem of deliberately manipulating the direct and indirect light flowing through a time-varying, fully-general scene in order to simplify its visual analysis. Our approach rests on a crucial link between stereo geometry and light transport: while direct light always obeys the epipolar geometry of a projector-camera pair, indirect light overwhelmingly does not. We show that it is possible to turn this observation into an imaging method that analyzes light transport in real time in the optical domain, prior to acquisition. This yields three key abilities that we demonstrate in an experimental camera prototype: (1) producing a live indirect-only video stream for any scene, regardless of geometric or photometric complexity; (2) capturing images that make existing structured-light shape recovery algorithms robust to indirect transport; and (3) turning them into one-shot methods for dynamic 3D shape capture.
Multi-phase-field simulations with OpenPhasePFHub PFHub
The document describes OpenPhase, an open-source phase field modeling toolbox for simulating microstructure evolution. OpenPhase uses a multi-phase field approach and includes modules for simulating processes like coarsening, diffusion, deformation, plasticity, damage, and fluid flow. It has been under development for over 10 years. The document provides an overview of OpenPhase capabilities and includes an example of using it to simulate Mg-Al alloy solidification, showing the effect of cooling rate on microstructure. It also gives details about setting up and running a simulation using the OpenPhase modules in C++.
The document provides tips for good decoding and comprehension when reading. It advises that good decoders sound out words, think of possible words based on letters, and use context clues. It also recommends that good comprehenders use prior knowledge, make connections, ask questions, visualize, infer, summarize, evaluate, and synthesize what they read by using multiple strategies together.
This document outlines 10 benefits of reading books on a daily basis, including mental stimulation, stress reduction, gaining knowledge, expanding vocabulary, improving memory, strengthening analytical thinking skills, improving focus and concentration, better writing skills, tranquility, and free entertainment. It encourages daily reading by stating that a reader lives many lives through books before dying, compared to a non-reader who only lives one life.
motion and feature based person tracking in survillance videosshiva kumar cheruku
The document summarizes and compares two common algorithms for person tracking in surveillance videos: background subtraction and frame difference. It then proposes a moving target detection algorithm based on background subtraction with a dynamic background. The background image is updated over time through superimposition of the current frame with the previous background image. This allows objects that remain stationary for a period of time to become part of the background. Experimental results showed this algorithm can detect and extract moving targets more effectively and precisely.
This document proposes an adaptive rood pattern search (ARPS) algorithm for fast block matching motion estimation. ARPS uses adjustable search patterns centered around a predicted motion vector value to quickly locate the minimum matching error point. It employs a checking bit-map and zero-motion prejudgment to avoid duplicate computations and benefit sequences with small motion. Experimental results show ARPS achieves 94-447x speedup over full search with less than 0.12dB drop in PSNR quality compared to full search for most sequences.
Introduction to Wavelet Transform and Two Stage Image DE noising Using Princi...ijsrd.com
In past two decades there are various techniques are developed to support variety of image processing applications. The applications of image processing include medical, satellite, space, transmission and storage, radar and sonar etc. But noise in image effect all applications. So it is necessary to remove noise from image. There are various methods and techniques are there to remove noise from images. Wavelet transform (WT) has been proved to be effective in noise removal but this have some problems that is overcome by PCA method. This paper presents an efficient image de-noising scheme by using principal component analysis (PCA) with local pixel grouping (LPG). This method provides better preservation of image local structures. In this method a pixel and its nearest neighbors are modeled as a vector variable whose training samples are selected from the local window by using block matching based LPG. In image de-noising, a compromise has to be found between noise reduction and preserving significant image details. PCA is a statistical technique for simplifying a dataset by reducing datasets to lower dimensions. It is a standard technique commonly used for data reduction in statistical pattern recognition and signal processing. This paper proposes a de-noising technique by using a new statistical approach, principal component analysis with local pixel grouping (LPG). This procedure is iterated second time to further improve the de-noising performance, and the noise level is adaptively adjusted in the second stage.
This document provides an overview and summary of a presentation on Simultaneous Localization and Mapping (SLAM). It introduces the speaker, Dong-Won Shin, and his background and research in SLAM. The contents of the presentation are then outlined, including an introduction to SLAM, traditional SLAM approaches like Extended Kalman Filter SLAM and FastSLAM, efforts towards large-scale mapping like graph-based SLAM and loop closure detection, modern state-of-the-art systems like ORB SLAM, KinectFusion and Lidar SLAM, and applications of SLAM. Key algorithms in visual odometry, backend optimization, and loop closure detection are also summarized.
Seed net automatic seed generation with deep reinforcement learning for robus...NAVER Engineering
본 논문에서는 interactive segmentation 문제를 풀기 위하여 deep reinforcement learning을 활용한 seed gereration 기법을 제안한다. Interactive segmentation 문제의 이슈 중 하나는 사용자의 개입을 최소화하는 것이다. 본 논문에서 제안하는 시스템이 사용자를 대신하여 인공적인 seed를 생성하게 된다. 사용자는 initial seed 정보만을 제공하면 된다. 우리는 optimal seed point 정의의 모호함으로 인해 supervised 기법을 사용하여 학습하기 어려운 점을 reinforcement learning 기법을 사용하여 극복하였다. Seed generation 문제에 맞도록 MDP를 정의하여 deep-q-network를 성공적으로 학습하였다. 우리는 MSRA10K 데이터셋에 대하여 학습을 진행하여 기존 segmentation 알고리즘의 부정확한 initial 결과 대비 우수한 성능을 보였다.
This document summarizes a project that used a deep learning model to predict depth images from single RGB images. It discusses existing solutions using stereo cameras or Kinect devices. The project used the NYU Depth V2 dataset, splitting it into training, validation, and test sets. It implemented a model based on previous work, training it on RGB-D image pairs for 35 epochs but achieving only moderate results due to limited training data. The code and results are available online for further exploration.
The document discusses background subtraction techniques for detecting moving objects in video frames. It introduces the mixture of Gaussians approach, which models each pixel as a combination of Gaussian distributions to determine if it belongs to the background or foreground. The key advantages of this approach are its robustness to repetitive motions and changes in lighting/weather. The document compares various techniques, then covers implementation details and challenges of applying mixture of Gaussians to an outdoor scene with moving vehicles and foliage.
Primal-dual coding photography is a new photographic technique that uses coded illumination and exposure to selectively record user-defined subsets of light paths, generalizing conventional photography. It modulates the contribution of specific light paths using a "probing matrix" of primal codes for illumination and dual codes for exposure over multiple frames. This allows effects like enhancing direct light, capturing indirect light of different ranges, separating light transport effects, and making 3D regions invisible or color-coded in the photo. The technique provides guarantees of optimality and convergence for reconstructing images from the coded photo measurements.
IRJET - Dehazing of Single Nighttime Haze Image using Superpixel MethodIRJET Journal
This document presents a new super-pixel based algorithm for removing haze from single nighttime images. It first decomposes the input hazy nighttime image into a glow image and glow-free hazy image using their relative smoothness. It then uses super-pixel segmentation to compute the atmospheric light and dark channel values for each pixel in the glow-free image. The transmission map is estimated from the dark channel using a weighted guided image filter. Compared to patch-based methods, using super-pixels can reduce morphological artifacts and allow a smaller filter radius to better preserve details. The proposed method is tested on nighttime hazy images and is able to effectively remove haze and restore clear nighttime scenes in 3 sentences or less
PhD defence public presentation, Bayesian methods for inverse problems with point clouds: applications to single-photon lidar, ENSEEHIT, Toulouse, France
Architecture Design for Deep Neural Networks IWanjin Yu
This document summarizes Gao Huang's presentation on neural architectures for efficient inference. The presentation covered three parts: 1) macro-architecture innovations in convolutional neural networks (CNNs) such as ResNet, DenseNet, and multi-scale networks; 2) micro-architecture innovations including group convolution, depthwise separable convolution, and attention mechanisms; and 3) moving from static networks to dynamic networks that can adaptively select simpler or more complex models based on input complexity. The key idea is to enable faster yet accurate inference by matching computational cost to input difficulty.
Deep Local Parametric Filters for Image EnhancementSean Moran
This document presents DeepLPF, a neural network architecture that can regress the parameters of learnable image filters to retouch and enhance input images. DeepLPF predicts the parameters of three types of filters - elliptical, graduated, and polynomial filters - that emulate common image editing tools. It achieves state-of-the-art performance on benchmark datasets while using a small number of neural network weights. The filters allow for interpretable, spatially localized adjustments to images.
발표자: 이준태 (고려대 박사과정)
발표일: 2017.9.
개요:
Algorithm to semantic line detection and its applications will be presented. First, I will introduce the concept of semantic line. Second, semantic line detection method will be described. Then, two applications will be presented: composition enhancement and image simplification.
The document discusses human action recognition using spatio-temporal features. It proposes using optical flow and shape-based features to form motion descriptors, which are then classified using Adaboost. Targets are localized using background subtraction. Optical flows within localized regions are organized into a histogram to describe motion. Differential shape information is also captured. The descriptors are used to train a strong classifier with Adaboost that can recognize actions in testing videos.
This document discusses a new technique called Perceptual Mixture of Gaussians (PMOG) for background subtraction in video analytics. PMOG incorporates characteristics of human visual perception to improve detection quality. It uses a realistic background value prediction based on the most recent observation. Detection threshold is set based on Weber's law and a just-noticeable difference threshold. Frame-level features are extracted from foreground blobs and transformed into temporal features for event detection. Experiments show PMOG has higher stability across environments and superior detection quality compared to other techniques. It provides an effective approach for abnormal event detection without requiring context information.
발표자: 고영준 (고려대 박사과정)
발표일: 2017.6.
개요:
Algorithms to segment objects in a video sequence will be presented.
First, I will introduce a primary object segmentation algorithm based on region augmentation and reduction. Second, collaborative detection, tracking, and segmentation for online multiple object segmentation will be presented.
Summary:
There are three parts in this presentation.
A. Why do we need Convolutional Neural Network
- Problems we face today
- Solutions for problems
B. LeNet Overview
- The origin of LeNet
- The result after using LeNet model
C. LeNet Techniques
- LeNet structure
- Function of every layer
In the following Github Link, there is a repository that I rebuilt LeNet without any deep learning package. Hope this can make you more understand the basic of Convolutional Neural Network.
Github Link : https://github.com/HiCraigChen/LeNet
LinkedIn : https://www.linkedin.com/in/YungKueiChen
(Paper Review)U-GAT-IT: unsupervised generative attentional networks with ada...MYEONGGYU LEE
The document introduces a new paper titled "U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation". It proposes a model that uses attention mechanisms to focus on discriminative regions between domains. A new normalization method called AdaLIN is also introduced to flexibly control the degree of shape and texture transformation without changing the model structure or hyperparameters. The model aims to learn mapping functions between unpaired source and target domains for tasks like selfie to anime image translation.
3D Shape and Indirect Appearance by Structured Light TransportMatthew O'Toole
3D Shape and Indirect Appearance by Structured Light Transport
Matthew O'Toole, John Mather, and Kiriakos N. Kutulakos. CVPR, 2014.
Abstract:
We consider the problem of deliberately manipulating the direct and indirect light flowing through a time-varying, fully-general scene in order to simplify its visual analysis. Our approach rests on a crucial link between stereo geometry and light transport: while direct light always obeys the epipolar geometry of a projector-camera pair, indirect light overwhelmingly does not. We show that it is possible to turn this observation into an imaging method that analyzes light transport in real time in the optical domain, prior to acquisition. This yields three key abilities that we demonstrate in an experimental camera prototype: (1) producing a live indirect-only video stream for any scene, regardless of geometric or photometric complexity; (2) capturing images that make existing structured-light shape recovery algorithms robust to indirect transport; and (3) turning them into one-shot methods for dynamic 3D shape capture.
Multi-phase-field simulations with OpenPhasePFHub PFHub
The document describes OpenPhase, an open-source phase field modeling toolbox for simulating microstructure evolution. OpenPhase uses a multi-phase field approach and includes modules for simulating processes like coarsening, diffusion, deformation, plasticity, damage, and fluid flow. It has been under development for over 10 years. The document provides an overview of OpenPhase capabilities and includes an example of using it to simulate Mg-Al alloy solidification, showing the effect of cooling rate on microstructure. It also gives details about setting up and running a simulation using the OpenPhase modules in C++.
The document provides tips for good decoding and comprehension when reading. It advises that good decoders sound out words, think of possible words based on letters, and use context clues. It also recommends that good comprehenders use prior knowledge, make connections, ask questions, visualize, infer, summarize, evaluate, and synthesize what they read by using multiple strategies together.
This document outlines 10 benefits of reading books on a daily basis, including mental stimulation, stress reduction, gaining knowledge, expanding vocabulary, improving memory, strengthening analytical thinking skills, improving focus and concentration, better writing skills, tranquility, and free entertainment. It encourages daily reading by stating that a reader lives many lives through books before dying, compared to a non-reader who only lives one life.
IE Presentation on the Benefits of Readingdevaratth
Reading provides numerous cognitive, social, and personal benefits. It develops mental capacity and vocabulary, allowing deeper understanding of various cultures. Reading improves concentration and self-esteem. Personal development comes from identifying with characters and learning from true stories. Professionally, reading aids growth and helps leaders develop others. Overall, reading is a tool for learning, self-discovery, and assisting others.
Teachers should focus on improving students' reading skills as it is important for developing other language abilities. There are three stages for teaching reading: pre-reading, while-reading, and post-reading. Each stage has specific strategies to prepare students, aid comprehension during reading, and check understanding after reading. Some examples include making predictions, using context clues, and summarizing. Following this structured approach can help students learn to independently comprehend and analyze texts.
The document discusses reading skills and difficulties. It covers three main components of reading: decoding, comprehension, and retention. Decoding involves translating printed words to sounds, comprehension is understanding the text, and retention is keeping or remembering the information read. Some common reading difficulties include dyslexia, vocabulary issues, memory problems, attention problems, and difficulties with decoding, comprehension, or retention.
This document discusses action recognition in videos. It begins by defining action recognition and describing its applications such as surveillance, video search, and medical monitoring. Challenges of action recognition like scale variations, camera motion, and human pose differences are presented. The document reviews papers on local space-time features with SVMs and two-stream convolutional networks. It shows that local features combined with SVMs achieved the best results on a dataset of human actions. Two-stream ConvNets, which use spatial and temporal streams, became the state-of-the-art by capturing shape from frames and motion from optical flow. Future work may explore deeper ConvNets with larger datasets.
The document presents a system for detecting complex events in unconstrained videos using pre-trained deep CNN models. Frame-level features extracted from various CNNs are fused to form video-level descriptors, which are then classified using SVMs. Evaluation on a large video corpus found that fusing different CNNs outperformed individual CNNs, and no single CNN worked best for all events as some are more object-driven while others are more scene-based. The best performance was achieved by learning event-dependent weights for different CNNs.
The slides for the techniques used in the Temporal Segment Network (TSN), including the basic ideas, recall of BN-Inception, optical flow and tricks in application. Used in group paper reading in University of Sydney.
Yen-Yu Lin presents research on video synthesis through frame interpolation. His lab uses deep learning models like DVF to predict intermediate frames between two consecutive frames. However, existing methods produce artifacts or over-smoothed results. The proposed approach uses a two-stage training procedure with cycle consistency loss to address this. It first pre-trains DVF, then fine-tunes with cycle loss to make the model robust to lack of data and produce higher quality frames. Experimental results show the approach outperforms state-of-the-art methods on standard datasets.
161209 Unsupervised Learning of Video Representations using LSTMsJunho Cho
This document summarizes an research paper that proposes three models for unsupervised learning of video representations using LSTMs: 1) an LSTM autoencoder model that reconstructs the input video sequence in reverse order, 2) an LSTM future predictor model that predicts subsequent frames in the video, and 3) a composite model that combines the autoencoder and predictor models to learn representations that capture both static and dynamic information. Experimental results on video datasets show the composite model performs best and its learned representations provide improved initialization for supervised action recognition tasks.
John W. Vinti Particle Tracker Final PresentationJohn Vinti
The document describes a project to develop a particle tracker software for use in a micro-particle image velocimetry (μPIVOT) system. The μPIVOT system is used to study particle behavior in fluids but faces bottlenecks around calibration, cell movement during experiments, and limited data extraction during experiments. The goals of the project are to create software that provides real-time video, image processing, particle tracking, deformation analysis, and variable output capabilities to address these bottlenecks. The proposed particle tracker software would utilize MATLAB for a graphical user interface and leverage existing licensing, while minimizing additional toolbox needs. It is intended to improve experiments on non-Newtonian fluids and provide an educational tool.
Video Classification: Human Action Recognition on HMDB-51 datasetGiorgio Carbone
Two-stream CNNs for video action recognition using Stacked Optical Flow, implemented in Keras, on HMDB-51 dataset.
We use spatial (ResNet-50 finetuned) and temporal stream cnn (stacked Optical Flows) under the Keras framework to perform Video-Based Human Action Recognition on HMDB-51 dataset.
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...Simone Ercoli
I presented an interesting paper during the Vision and Multimedia Reading Group about DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition (pdf).
It is a complete evaluation about features extracted from the activation of a deep convolutional network trained with a large scale dataset.
This a work of Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, Trevor Darrell from Berkeley University
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon TransformFadwa Fouad
This document provides an overview of a Masters thesis that proposes algorithms for human action recognition. It begins with an introduction that discusses the importance of human action recognition, challenges in the field, and differences between actions and activities. It then presents an agenda that outlines an introduction, overview, and details of two proposed algorithms: 2DHOOF/2DPCA contour-based optical flow and human gesture recognition using Radon transform/2DPCA. The overview section describes the general structure of action recognition systems from video capture to classification. Experimental results on benchmark datasets demonstrate the effectiveness of the proposed algorithms.
ntroduction of Signal such as sinosoidal signals, definition of signalsDrAjayKumarYadav4
This document outlines a course on signals and systems. It begins with the course outcomes, which are to identify different signals and systems, apply concepts like impulse response and convolution, perform spectral analysis using Fourier transforms, analyze discrete-time signals using Z-transforms, and demonstrate skills in sampling and reconstructing signals. It then lists the textbook and an NPTEL course on the subject. The document proceeds to define what a signal and system are, provide examples, and discuss important applications in areas like communication, control, MEMS, remote sensing, and biomedical signal processing.
The document describes a proposed approach for human action recognition using an attention based spatiotemporal graph convolutional network. The approach uses both temporal and spatial attention modules to select important frames and joints from skeletal data. The temporal attention module identifies informative frames, while the spatial attention module highlights significant joints within selected frames. Both attention modules help capture temporal and spatial relationships in skeletal data to improve action recognition accuracy. The proposed network incorporates temporal and spatial attention mechanisms into a graph convolutional network to efficiently recognize human actions from skeleton sequences.
Deep learning fundamental and Research project on IBM POWER9 system from NUSGanesan Narayanasamy
Moving object recognition (MOR) corresponds to the localisation and classification of moving objects in videos. Discriminating moving objects from static objects and background in videos is an essential task for many computer vision applications. MOR has widespread applications in intelligent visual surveillance, intrusion detection, anomaly detection and monitoring, industrial sites monitoring, detection-based tracking, autonomous vehicles, etc. In this session, Murari is going to talk about the deep learning algorithms to identify both locations and corresponding categories of moving objects with a convolutional network. The challenges in developing such algorithms will be discussed. The discourse will also include the implementation details of these models in both conventional and UAV videos.
On the Influence Propagation of Web Videosabidhavp
The document proposes a method called Noise-reductive Local-and-Global Learning (NLGL) to model the propagation and influence of web videos. NLGL aims to predict how videos spread across communities on the internet and their influence. It creates a Unified Virtual Community Space (UVCS) to capture a video's propagation history. NLGL then uses an iterative algorithm to jointly perform dimension reduction, predict labels for unlabeled training data, and learn a classifier, in order to better model propagation patterns and influence rankings. The method aims to handle large datasets and predict rankings for new videos.
This document discusses Motaz El Saban's research experience and interests which focus on analyzing, modeling, learning from, and predicting digital media content such as text, images, and speech. Some key areas of research include real-time video stitching, annotating mobile videos, object and activity recognition from videos, and facial expression recognition using deep learning techniques. The document also outlines El Saban's educational background and provides an agenda for his upcoming presentation.
We presents a deep architecture for dense semantic correspondence, called pyramidal affine regression networks (PARN), that estimates locally-varying affine transformation fields across images.
To deal with intra-class appearance and shape variations that commonly exist among different instances within the same object category,
we leverage a pyramidal model where affine transformation fields are progressively estimated in a coarse-to-fine manner so that the smoothness constraint is naturally imposed within deep networks.
PARN estimates residual affine transformations at each level and composes them to estimate final affine transformations.
Furthermore, to overcome the limitations of insufficient training data for semantic correspondence, we propose a novel weakly-supervised training scheme that generates progressive supervisions by leveraging a correspondence consistency across image pairs.
Our method is fully learnable in an end-to-end manner and does not require quantizing infinite continuous affine transformation fields.
Silhouette analysis based action recognition via exploiting human posesAVVENIRE TECHNOLOGIES
We propose a novel scheme for human action recognition that combines the advantages of both local and global representations.
We explore human silhouettes for human action representation by taking into account the correlation between sequential poses in an action.
The document proposes a framework for recognizing actions across cameras by exploring correlation subspaces. It first learns a joint subspace using Canonical Correlation Analysis (CCA) on unlabeled multi-view data. It then trains a Support Vector Machine (SVM) in this subspace with a novel correlation regularizer that favors dimensions with higher correlation between views, improving generalization to target views. Experiments on the IXMAS dataset show the method outperforms baselines, with the regularizer successfully suppressing weights for less correlated dimensions.
Large-scale Video Classification with Convolutional Neural Net.docxcroysierkathey
Large-scale Video Classification with Convolutional Neural Networks
Andrej Karpathy1,2 George Toderici1 Sanketh Shetty1
[email protected][email protected][email protected]
Thomas Leung1 Rahul Sukthankar1 Li Fei-Fei2
[email protected][email protected][email protected]
1Google Research 2Computer Science Department, Stanford University
http://cs.stanford.edu/people/karpathy/deepvideo
Abstract
Convolutional Neural Networks (CNNs) have been es-
tablished as a powerful class of models for image recog-
nition problems. Encouraged by these results, we pro-
vide an extensive empirical evaluation of CNNs on large-
scale video classification using a new dataset of 1 million
YouTube videos belonging to 487 classes. We study mul-
tiple approaches for extending the connectivity of a CNN
in time domain to take advantage of local spatio-temporal
information and suggest a multiresolution, foveated archi-
tecture as a promising way of speeding up the training.
Our best spatio-temporal networks display significant per-
formance improvements compared to strong feature-based
baselines (55.3% to 63.9%), but only a surprisingly mod-
est improvement compared to single-frame models (59.3%
to 60.9%). We further study the generalization performance
of our best model by retraining the top layers on the UCF-
101 Action Recognition dataset and observe significant per-
formance improvements compared to the UCF-101 baseline
model (63.3% up from 43.9%).
1. Introduction
Images and videos have become ubiquitous on the in-
ternet, which has encouraged the development of algo-
rithms that can analyze their semantic content for vari-
ous applications, including search and summarization. Re-
cently, Convolutional Neural Networks (CNNs) [15] have
been demonstrated as an effective class of models for un-
derstanding image content, giving state-of-the-art results
on image recognition, segmentation, detection and retrieval
[11, 3, 2, 20, 9, 18]. The key enabling factors behind these
results were techniques for scaling up the networks to tens
of millions of parameters and massive labeled datasets that
can support the learning process. Under these conditions,
CNNs have been shown to learn powerful and interpretable
image features [28]. Encouraged by positive results in do-
main of images, we study the performance of CNNs in
large-scale video classification, where the networks have
access to not only the appearance information present in
single, static images, but also their complex temporal evolu-
tion. There are several challenges to extending and applying
CNNs in this setting.
From a practical standpoint, there are currently no video
classification benchmarks that match the scale and variety
of existing image datasets because videos are significantly
more difficult to collect, annotate and store. To obtain suffi-
cient amount of data needed to train our CNN architectures,
we collected a new Sports-1M dataset, which consists of 1
million YouTube videos belonging to a taxonomy ...
Similar to Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD) (20)
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
हिंदी वर्णमाला पीपीटी, hindi alphabet PPT presentation, hindi varnamala PPT, Hindi Varnamala pdf, हिंदी स्वर, हिंदी व्यंजन, sikhiye hindi varnmala, dr. mulla adam ali, hindi language and literature, hindi alphabet with drawing, hindi alphabet pdf, hindi varnamala for childrens, hindi language, hindi varnamala practice for kids, https://www.drmullaadamali.com
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
How to Manage Your Lost Opportunities in Odoo 17 CRMCeline George
Odoo 17 CRM allows us to track why we lose sales opportunities with "Lost Reasons." This helps analyze our sales process and identify areas for improvement. Here's how to configure lost reasons in Odoo 17 CRM
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
2. My Introduction
Saimunur Rahman
Graduate Research Assistant (JS)
Centre of Visual Computing
Multimedia University, Cyberjaya Campus
Facebook: fb.me/saimunur.rahman
Web: http://saimunur.github.io
3. Today’s Agenda
• Talk on “Action Recognition with Trajectory-Pooled Deep-
Convolutional Descriptors” by L. Wang, Y. Qiao, and X. Tang
published at CVPR 2015.
Lets begin with some vision-based
action recognition basics
10. Research Trend in Action Recognition
• Hand-crafted method
‒ Holistic or global method
‒ Localized method
• Unsupervised method
‒ Deep learning
• Fusion
11. Action Recognition with Trajectory-
Pooled Deep-Convolutional Descriptors
Limin Wang1,2,Yu Qiao2, Xiaoou Tang1,2
1The Chinese University of Hong Kong
2Shenzhen Institutes of Advanced Technology
CVPR 2015 Poster
Total Citation : 40 (18 in 2016)
12. Main Idea
• Utilize deep architectures to learn conv. feature maps
• Apply trajectory based pooling to aggregate conv. features into
effective descriptors
• Aims to combine the benefits of both hand-crafted and deep-
learned features.
13. Motivations
• Hand-crafted methods are lack of discriminative capacity
• Current deep learning do not differentiate between spatial and
temporal domain
‒ Treat temporal dimension as feature channels when image trained
ConvNet is use to model videos
14. Contributions
• Modified Two-stream CNN model [Simonyan and Zisserman, NIPS 2014] trained on
UCF-101 [Soomro et al., CoRR 2012]
• Two CNN normalization method
• Thorough evaluation of later Convolution layers (Conv. 3,4,5)
• Multi-scale extension
16. Trajectory extraction
• Used improved dense trajectory (iDT) [Wang et al., ICCV 13]
• Camera motion removal
‒ Compute optical flow
‒ Homography estimation using RANSAC [Fischler & Bolles. 1981]
‒ SURF and Optical flow (OF) for similarity between two frames
‒ Re-compute the optical flow – warped flow
• Trajectory estimation
‒ Trajectories using dense trajectories [Wang et al. 11]
‒ Track points with original spatial scale (results 2-3% less
than multi-scale) [Wang et al. 11]
Image reproduced from Wang et al. 2013
Trajectory detectionInput video
Input video source: YouTube
17. Feature map extraction
Two-stream network [Simonyan and Zisserman, NIPS 2014], Use CNN-M-2048 model [Chatfield et al, BMVC 2014]
Proposed network model of both spatial and temporal stream
18. Feature map extraction (2)
• Spatial-net: frame-by-frame
• Temporal-net: stack optical flow volume (one frame is replicated)
• Trajectory mapping:
‒ Zero-padding 𝑘/2 , 𝑘 is kernel size in conv and pooling
‒ Trajectory point mapping: (𝑥, 𝑦, 𝑡) → (𝑟 ∗ 𝑥, 𝑟 ∗ 𝑦, 𝑡), 𝑟 is feature map ratio w.r.t input
image
19. Trajectory pooled descriptor (TDD)
• Local trajectory-aligned descriptor computed in a 3D volume around the
trajectory.
• The size is 𝑁 × 𝑁 × 𝑃 where, 𝑁 is spatial size and 𝑃 is traj. Length.
• Feature Normalization (Ensure everything is in same range and equ. Cont.)
‒ Spatiotemporal Normalization: 𝐶𝑠𝑡(𝑥, 𝑦, 𝑡, 𝑛) = 𝐶(𝑥, 𝑦, 𝑡, 𝑛)/𝑚𝑎𝑥 𝑥,𝑦,𝑡(𝑥, 𝑦, 𝑡, 𝑛)
‒ Channel Normalization: 𝐶𝑠𝑡(𝑥, 𝑦, 𝑡, 𝑛) = 𝐶(𝑥, 𝑦, 𝑡, 𝑛)/𝑚𝑎𝑥 𝑛(𝑥, 𝑦, 𝑡, 𝑛)
• TDD estimation is done by sum-pooling normalized channels over the
trajectory: 𝐷 𝑇𝑘, 𝐶 𝑚 = 𝑝=1
𝑃
𝐶 𝑚 𝑟 𝑚 × 𝑥 𝑝
𝑘 , 𝑟 𝑚 × 𝑦𝑝
𝑘 , 𝑡 𝑘
20. Multi-scale TDD
1. Multi-scale pyramid representations of video frames and optical flow fields.
2. Pyramid representations are fed into the two stream ConvNets for multi-
scale feature map
3. Calculate multi-scale TDD: (𝑥, 𝑦, 𝑡) → (𝑟 𝑚 × 𝑠 × 𝑥, 𝑟 𝑚 × 𝑠 × 𝑦, 𝑡), 𝑠 is the
scale of features and 𝑠 =
1
2
,
1
2
, 1, 2, 2
Spatial net pyramid Temporal net pyramid
21. Datasets
• HMDB51 [Kuehne et al., ICCV 2011]
• 6, 766 video clips from 51 action categories
• 3 splits for evaluation, each split has 70% training and 30% testing samples
51 action classes
22. Datasets
• UCF-101
• 13, 320 video clips from 101 action categories
• THUMOS13 challenge evaluation scheme with three training/testing splits
101 action classes
23. Implementation - ConvNet Training
• Spatial Net
1. UCF-101 first split → resize frame to 256x256 → rand. crop 224x224 → rand. horizontal flip
2. Pre-train the network with publicly available model from Chatfield et al. (BMVC 2014)
3. Fine tune the model parameters on the UCF101 dataset (full dataset)
• Temporal Net
1. 3D volume → resize to 256x256x.. → rand. crop 224x224x20 → rand. horizontal flip →
selection of 10 frames (for performance and efficiency balancing)
2. Train temporal net on UCF101 from scratch
3. High dropout ratio for FC6, FC7 for improve the generalization capacity of trained model
(Training Dataset is relatively small !!)
24. Implementation – Feature Encoding
• Used fisher vector (FV) [Sanchez et al., IJCV 2013]
• GMM clusters K = 256
• PCA to reduce dimensionality D, FV is 2𝐾𝐷 where 𝐷 is feature (vector) dimension!!
• Linear SVM as the classifier (𝐶 = 100)
25. Experimental Results
• Shape is important!! See iDT vs. HOF+MBH
• Motion performance is better in 2-st. ConvNet
‒ See Temporal Net
• Early Conv. Layer is better for both Net
• Spatial Conv. 4+5 is slightly better for UCF-101
• Temporal Conv. 4+5 is better for HMDB51
• iDT can further boost the TDD
• 63.2% → 65.9% (HMDB51)
• 90.3% → 91.5% (UCF-101)
27. ConvNet Layer performance
• Conv1 and Conv2 are outputs of max pooling layers after convolution operations
• Conv3, Conv4 and Conv5 are outputs of RELU activations
• Observations: Earlier layers performs better than laters e.g. conv3 in Temporal ConvNet
29. Conclusions
• An idea of exploiting 2D CNN models for action recognition
• Exploited raw image value and optical flow for model training
• Normalization of feature maps increase performance
• Single-trajectory features are good enough to achieve competitive perfm.
• Late Conv layers offers more discriminative features
• Handcrafted features can help to boost the feature performance
30. Few important information about TDD
• Spatial (pre-trained and fine-tuned) and Temporal model are available online
• Dense optical flow and trajectory code is also available online
• Ready-to-go main script (MatLab) for Linux is also available online
• For CNN the Caffe toolbox (Python) was used!!