Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as composition of spatio-temporal scene graphs. arXiv preprint arXiv:1912.06992, 2019.
Augmented reality (AR) is a field that combines real-world and computer-generated data. AR enhances our senses by overlaying graphics, sounds, and other enhancements onto real-world environments in real-time. Key components of AR systems include head-mounted displays, tracking and orientation sensors, and mobile computing devices. AR has applications in areas like maintenance, military, medicine, media, and gaming. While AR faces challenges like accurate tracking, future applications may allow expanding computer screens into real environments and replacing devices like phones with virtual equivalents.
Empathic Computing and Collaborative Immersive AnalyticsMark Billinghurst
This document discusses empathic computing and collaborative immersive analytics. It notes that while fields like scientific and information visualization are well established, little research has looked at collaborative visualization specifically. Collaborative immersive analytics combines mixed reality, visual analytics and computer-supported cooperative work. Empathic computing aims to develop systems that allow sharing experiences, emotions and perspectives using technologies like virtual and augmented reality with physiological sensors. Applying these concepts could enhance communication and understanding for collaborative immersive analytics tasks.
This document summarizes research on deep learning approaches for face recognition. It describes the DeepFace model from Facebook, which used a deep convolutional network trained on 4.4 million faces to achieve state-of-the-art accuracy on the Labeled Faces in the Wild (LFW) dataset. It also summarizes the DeepID2 and DeepID3 models from Chinese University of Hong Kong, which employed joint identification-verification training of convolutional networks and achieved performance comparable or superior to DeepFace on LFW. Evaluation metrics for face verification and identification tasks are also outlined.
This document provides an overview of a course on computer vision called CSCI 455: Intro to Computer Vision. It acknowledges that many of the course slides were modified from other similar computer vision courses. The course will cover topics like image filtering, projective geometry, stereo vision, structure from motion, face detection, object recognition, and convolutional neural networks. It highlights current applications of computer vision like biometrics, mobile apps, self-driving cars, medical imaging, and more. The document discusses challenges in computer vision like viewpoint and illumination variations, occlusion, and local ambiguity. It emphasizes that perception is an inherently ambiguous problem that requires using prior knowledge about the world.
This document discusses research on human action recognition using skeleton data. It introduces issues with skeleton-based action recognition, such as variable scales, view orientations, noise and rate/intra-action variations. It then reviews previous work on skeleton-based action recognition using hand-crafted features and deep learning models. The document proposes two ensemble deep learning models called Ensemble TS-LSTM v1 and v2 that use temporal sliding LSTMs to capture short, medium and long-term dependencies from skeleton sequences for action recognition. Experimental results on standard datasets demonstrate the models outperform previous methods.
This document provides an overview of computer vision techniques including:
1. Using pre-trained CNN models for tasks like classification and object detection. Popular models discussed include AlexNet, VGG, ResNet, YOLO, and DenseNet.
2. Basic CNN operations like convolution, pooling, dropout, and normalization. Feature extraction using CNNs and techniques like transfer learning and fine-tuning pretrained models.
3. Additional computer vision tasks covered include object detection using Haar cascades, stereo vision, pattern detection, and reconstructing images from CNN features. Frameworks like PyTorch and libraries like TensorFlow are also mentioned.
Structure from motion is a computer vision technique used to recover the three-dimensional structure of a scene and the camera motion from a set of images. It can be used to build 3D models of scenes without any prior knowledge of the camera parameters or 3D locations of the scene points. Structure from motion involves detecting feature points in multiple images, matching the features between images, estimating the fundamental matrices between image pairs, and then optimizing a bundle adjustment problem to simultaneously compute the 3D structure and camera motion parameters. Some applications of structure from motion include 3D modeling, surveying, robot navigation, virtual and augmented reality, and visual effects.
Augmented reality (AR) is a field that combines real-world and computer-generated data. AR enhances our senses by overlaying graphics, sounds, and other enhancements onto real-world environments in real-time. Key components of AR systems include head-mounted displays, tracking and orientation sensors, and mobile computing devices. AR has applications in areas like maintenance, military, medicine, media, and gaming. While AR faces challenges like accurate tracking, future applications may allow expanding computer screens into real environments and replacing devices like phones with virtual equivalents.
Empathic Computing and Collaborative Immersive AnalyticsMark Billinghurst
This document discusses empathic computing and collaborative immersive analytics. It notes that while fields like scientific and information visualization are well established, little research has looked at collaborative visualization specifically. Collaborative immersive analytics combines mixed reality, visual analytics and computer-supported cooperative work. Empathic computing aims to develop systems that allow sharing experiences, emotions and perspectives using technologies like virtual and augmented reality with physiological sensors. Applying these concepts could enhance communication and understanding for collaborative immersive analytics tasks.
This document summarizes research on deep learning approaches for face recognition. It describes the DeepFace model from Facebook, which used a deep convolutional network trained on 4.4 million faces to achieve state-of-the-art accuracy on the Labeled Faces in the Wild (LFW) dataset. It also summarizes the DeepID2 and DeepID3 models from Chinese University of Hong Kong, which employed joint identification-verification training of convolutional networks and achieved performance comparable or superior to DeepFace on LFW. Evaluation metrics for face verification and identification tasks are also outlined.
This document provides an overview of a course on computer vision called CSCI 455: Intro to Computer Vision. It acknowledges that many of the course slides were modified from other similar computer vision courses. The course will cover topics like image filtering, projective geometry, stereo vision, structure from motion, face detection, object recognition, and convolutional neural networks. It highlights current applications of computer vision like biometrics, mobile apps, self-driving cars, medical imaging, and more. The document discusses challenges in computer vision like viewpoint and illumination variations, occlusion, and local ambiguity. It emphasizes that perception is an inherently ambiguous problem that requires using prior knowledge about the world.
This document discusses research on human action recognition using skeleton data. It introduces issues with skeleton-based action recognition, such as variable scales, view orientations, noise and rate/intra-action variations. It then reviews previous work on skeleton-based action recognition using hand-crafted features and deep learning models. The document proposes two ensemble deep learning models called Ensemble TS-LSTM v1 and v2 that use temporal sliding LSTMs to capture short, medium and long-term dependencies from skeleton sequences for action recognition. Experimental results on standard datasets demonstrate the models outperform previous methods.
This document provides an overview of computer vision techniques including:
1. Using pre-trained CNN models for tasks like classification and object detection. Popular models discussed include AlexNet, VGG, ResNet, YOLO, and DenseNet.
2. Basic CNN operations like convolution, pooling, dropout, and normalization. Feature extraction using CNNs and techniques like transfer learning and fine-tuning pretrained models.
3. Additional computer vision tasks covered include object detection using Haar cascades, stereo vision, pattern detection, and reconstructing images from CNN features. Frameworks like PyTorch and libraries like TensorFlow are also mentioned.
Structure from motion is a computer vision technique used to recover the three-dimensional structure of a scene and the camera motion from a set of images. It can be used to build 3D models of scenes without any prior knowledge of the camera parameters or 3D locations of the scene points. Structure from motion involves detecting feature points in multiple images, matching the features between images, estimating the fundamental matrices between image pairs, and then optimizing a bundle adjustment problem to simultaneously compute the 3D structure and camera motion parameters. Some applications of structure from motion include 3D modeling, surveying, robot navigation, virtual and augmented reality, and visual effects.
The document discusses the fundamental steps in digital image processing, which include image acquisition, restoration, enhancement, representation and description, segmentation, recognition, morphological processing, color processing, compression, and problem domains. It provides examples and brief explanations of each step. For example, it states that image restoration aims to recover an degraded image using knowledge of the degradation process, while enhancement manipulates an image to be more suitable for a specific application based on human preferences. The document also presents a model of the image degradation and restoration process.
The document provides an overview of computer vision including:
- It defines computer vision as using observed image data to infer something about the world.
- It briefly discusses the history of computer vision from early projects in 1966 to David Marr establishing the foundations of modern computer vision in the 1970s.
- It lists several related fields that computer vision draws from including artificial intelligence, information engineering, neurobiology, solid-state physics, and signal processing.
- It provides examples of applications of computer vision such as self-driving vehicles, facial recognition, augmented reality, and uses in smartphones, the web, VR/AR, medical imaging, and insurance.
This document provides an introduction to multiple object tracking (MOT). It discusses the goal of MOT as detecting and linking target objects across frames. It describes common MOT approaches including using boxes or masks to represent objects. The document also categorizes MOT based on factors like whether it tracks a single or multiple classes, in 2D or 3D, using a single or multiple cameras. It reviews old and new evaluation metrics for MOT and highlights state-of-the-art methods on various MOT datasets. In conclusion, it notes that while MOT research is interesting, standardized evaluation metrics and protocols still need improvement.
A completed modeling of local binary pattern operatorWin Yu
This document presents the completed local binary pattern (CLBP) operator for texture classification. CLBP generalizes and completes the local binary pattern (LBP) by using a local difference sign-magnitude transform to encode the missing texture information not captured by LBP. The CLBP operator fuses three codes - CLBP_C for the center pixel, CLBP_S for the signs of differences, and CLBP_M for the magnitudes. Experiments on the Outex texture database show CLBP achieves much better classification accuracy than LBP and other state-of-the-art methods.
The document summarizes digital image processing. It discusses the origins of digital images in the 1920s for transmitting newspaper photographs via transatlantic cable. The field grew with early computer processing of images from space missions in the 1960s. The document outlines fundamental steps in digital image processing, including image acquisition, enhancement, restoration, color processing, wavelets, compression, morphology, segmentation, and representation. It provides an abstract for a seminar report on digital image processing.
This document summarizes a presentation on augmented reality. It discusses what augmented reality is, the components and technologies used including displays, tracking and input devices. Examples of medical, manufacturing, education and military applications are provided. Recent innovations in augmented reality apps and future innovations like AR glasses are outlined. The educational benefits of augmented reality are explained. In conclusion, while augmented reality is still developing, applications in areas like aircraft manufacturing show promise.
The document discusses using facial recognition for attendance tracking in a school setting. It proposes developing a system that uses real-time face detection and Principal Component Analysis to match detected faces to staff members and automatically record their attendance. This would eliminate the manual and time-consuming process of logging attendance. The system would enroll staff faces during a one-time process and then identify and update their attendance in a database system in real-time. Research shows this type of automatic attendance tracking outperforms manual systems and provides more efficient leave and interface management.
The document discusses augmented reality (AR) and its potential applications. It begins by defining AR as enhancing one's current perception of reality by overlaying digital information. The technology aims to seamlessly blend virtual objects with the real world by tracking a user's movements and positioning graphics accordingly. Some key points:
- AR is still in the early research phase but may become widely available by the next decade in the form of glasses.
- It has applications in education, gaming, military, and more by providing contextual information about one's surroundings.
- The main components of an AR system are head-mounted displays, tracking systems, and mobile computing power.
- There are two main types of head-mounted
This document discusses augmented reality technology and visual tracking methods. It covers how humans perceive reality through their senses like sight, hearing, touch, etc. and how virtual reality systems use input and output devices. There are different types of visual tracking including marker-based tracking using artificial markers, markerless tracking using natural features, and simultaneous localization and mapping which builds a model of the environment while tracking. Common tracking technologies involve optical, magnetic, ultrasonic, and inertial sensors. Optical tracking in augmented reality uses computer vision techniques like feature detection and matching.
COMP4010 Lecture 4 - VR Technology - Visual and Haptic Displays. Lecture about VR visual and haptic display technology. Taught on August 16th 2016 by Mark Billinghurst from the University of South Australia
Attendance system based on face recognition using python by Raihan Sikdarraihansikdar
The document discusses face recognition technology for use in an automatic attendance system. It first defines biometrics and face recognition, explaining that face recognition identifies individuals using facial features. It then covers how face recognition systems work by detecting nodal points on faces to create unique face prints. The document proposes using such a system to take student attendance in online classes during the pandemic, noting advantages like ease of use, increased security, and cost effectiveness. It provides examples of how the system would capture images, analyze features, and recognize enrolled students to record attendance automatically.
Lec13: Clustering Based Medical Image Segmentation MethodsUlaş Bağcı
Clustering – K-means
– FCM (fuzzyc-means)
– SMC (simple membership based clustering) – AP(affinity propagation)
– FLAB(fuzzy locally adaptive Bayesian)
– Spectral Clustering Methods
ShapeModeling – M-reps – Active Shape Models (ASM) – Oriented Active Shape Models (OASM) – Application in anatomy recognition and segmentation – Comparison of ASM and OASM ActiveContour(Snake) • LevelSet • Applications Enhancement, Noise Reduction, and Signal Processing • MedicalImageRegistration • MedicalImageSegmentation • MedicalImageVisualization • Machine Learning in Medical Imaging • Shape Modeling/Analysis of Medical Images Deep Learning in Radiology Fuzzy Connectivity (FC) – Affinity functions • Absolute FC • Relative FC (and Iterative Relative FC) • Successful example applications of FC in medical imaging • Segmentation of Airway and Airway Walls using RFC based method Energy functional – Data and Smoothness terms • GraphCut – Min cut – Max Flow • ApplicationsinRadiologyImages
Semantic Segmentation on Satellite ImageryRAHUL BHOJWANI
This is an Image Semantic Segmentation project targeted on Satellite Imagery. The goal was to detect the pixel-wise segmentation map for various objects in Satellite Imagery including buildings, water bodies, roads etc. The data for this was taken from the Kaggle competition <https://www.kaggle.com/c/dstl-satellite-imagery-feature-detection>.
We implemented FCN, U-Net and Segnet Deep learning architectures for this task.
The document discusses human action recognition using spatio-temporal features. It proposes using optical flow and shape-based features to form motion descriptors, which are then classified using Adaboost. Targets are localized using background subtraction. Optical flows within localized regions are organized into a histogram to describe motion. Differential shape information is also captured. The descriptors are used to train a strong classifier with Adaboost that can recognize actions in testing videos.
Computer vision is the study of analyzing images and videos to understand and interpret visual content. It involves developing techniques to achieve tasks like object detection and recognition. Computer vision has many applications including optical character recognition, face detection, 3D modeling, robotics, medical imaging, and self-driving cars. OpenCV is a popular open source library for computer vision that contains over 2500 optimized algorithms and supports languages like C++, Python, and Java.
The document describes the Marching Cubes algorithm, which was developed in 1987 to construct 3D models from medical imaging data like CT scans. It works by dividing the volume into cubes and using the pixel values at the cube vertices to determine triangles that approximate the surface. There are 256 possible cases but they can be reduced to 14 basic patterns. The algorithm calculates surface normals to improve image quality and has been used to generate 3D models from various medical imaging modalities.
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [추적 24시] : 완전 자동결제를 위한 무인점포 이용자 Tracking System 개발BOAZ Bigdata
데이터 분석 프로젝트를 진행한 추적24시 팀에서는 아래와 같은 프로젝트를 진행했습니다.
완전 자동결제를 위한 무인점포 이용자 Tracking System 개발
19기 민경원 고려대학교 산업경영공학부
19기 신재욱 연세대학교 산업공학과
19기 이유빈 서울여자대학교 소프트웨어융합학과
19기 최가희 국립공주대학교 산업시스템공학과
Silhouette analysis based action recognition via exploiting human posesAVVENIRE TECHNOLOGIES
We propose a novel scheme for human action recognition that combines the advantages of both local and global representations.
We explore human silhouettes for human action representation by taking into account the correlation between sequential poses in an action.
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES sipij
Human action recognition is still a challenging problem and researchers are focusing to investigate this
problem using different techniques. We propose a robust approach for human action recognition. This is
achieved by extracting stable spatio-temporal features in terms of pairwise local binary pattern (P-LBP)
and scale invariant feature transform (SIFT). These features are used to train an MLP neural network
during the training stage, and the action classes are inferred from the test videos during the testing stage.
The proposed features well match the motion of individuals and their consistency, and accuracy is higher
using a challenging dataset. The experimental evaluation is conducted on a benchmark dataset commonly
used for human action recognition. In addition, we show that our approach outperforms individual features
i.e. considering only spatial and only temporal feature.
The document discusses the fundamental steps in digital image processing, which include image acquisition, restoration, enhancement, representation and description, segmentation, recognition, morphological processing, color processing, compression, and problem domains. It provides examples and brief explanations of each step. For example, it states that image restoration aims to recover an degraded image using knowledge of the degradation process, while enhancement manipulates an image to be more suitable for a specific application based on human preferences. The document also presents a model of the image degradation and restoration process.
The document provides an overview of computer vision including:
- It defines computer vision as using observed image data to infer something about the world.
- It briefly discusses the history of computer vision from early projects in 1966 to David Marr establishing the foundations of modern computer vision in the 1970s.
- It lists several related fields that computer vision draws from including artificial intelligence, information engineering, neurobiology, solid-state physics, and signal processing.
- It provides examples of applications of computer vision such as self-driving vehicles, facial recognition, augmented reality, and uses in smartphones, the web, VR/AR, medical imaging, and insurance.
This document provides an introduction to multiple object tracking (MOT). It discusses the goal of MOT as detecting and linking target objects across frames. It describes common MOT approaches including using boxes or masks to represent objects. The document also categorizes MOT based on factors like whether it tracks a single or multiple classes, in 2D or 3D, using a single or multiple cameras. It reviews old and new evaluation metrics for MOT and highlights state-of-the-art methods on various MOT datasets. In conclusion, it notes that while MOT research is interesting, standardized evaluation metrics and protocols still need improvement.
A completed modeling of local binary pattern operatorWin Yu
This document presents the completed local binary pattern (CLBP) operator for texture classification. CLBP generalizes and completes the local binary pattern (LBP) by using a local difference sign-magnitude transform to encode the missing texture information not captured by LBP. The CLBP operator fuses three codes - CLBP_C for the center pixel, CLBP_S for the signs of differences, and CLBP_M for the magnitudes. Experiments on the Outex texture database show CLBP achieves much better classification accuracy than LBP and other state-of-the-art methods.
The document summarizes digital image processing. It discusses the origins of digital images in the 1920s for transmitting newspaper photographs via transatlantic cable. The field grew with early computer processing of images from space missions in the 1960s. The document outlines fundamental steps in digital image processing, including image acquisition, enhancement, restoration, color processing, wavelets, compression, morphology, segmentation, and representation. It provides an abstract for a seminar report on digital image processing.
This document summarizes a presentation on augmented reality. It discusses what augmented reality is, the components and technologies used including displays, tracking and input devices. Examples of medical, manufacturing, education and military applications are provided. Recent innovations in augmented reality apps and future innovations like AR glasses are outlined. The educational benefits of augmented reality are explained. In conclusion, while augmented reality is still developing, applications in areas like aircraft manufacturing show promise.
The document discusses using facial recognition for attendance tracking in a school setting. It proposes developing a system that uses real-time face detection and Principal Component Analysis to match detected faces to staff members and automatically record their attendance. This would eliminate the manual and time-consuming process of logging attendance. The system would enroll staff faces during a one-time process and then identify and update their attendance in a database system in real-time. Research shows this type of automatic attendance tracking outperforms manual systems and provides more efficient leave and interface management.
The document discusses augmented reality (AR) and its potential applications. It begins by defining AR as enhancing one's current perception of reality by overlaying digital information. The technology aims to seamlessly blend virtual objects with the real world by tracking a user's movements and positioning graphics accordingly. Some key points:
- AR is still in the early research phase but may become widely available by the next decade in the form of glasses.
- It has applications in education, gaming, military, and more by providing contextual information about one's surroundings.
- The main components of an AR system are head-mounted displays, tracking systems, and mobile computing power.
- There are two main types of head-mounted
This document discusses augmented reality technology and visual tracking methods. It covers how humans perceive reality through their senses like sight, hearing, touch, etc. and how virtual reality systems use input and output devices. There are different types of visual tracking including marker-based tracking using artificial markers, markerless tracking using natural features, and simultaneous localization and mapping which builds a model of the environment while tracking. Common tracking technologies involve optical, magnetic, ultrasonic, and inertial sensors. Optical tracking in augmented reality uses computer vision techniques like feature detection and matching.
COMP4010 Lecture 4 - VR Technology - Visual and Haptic Displays. Lecture about VR visual and haptic display technology. Taught on August 16th 2016 by Mark Billinghurst from the University of South Australia
Attendance system based on face recognition using python by Raihan Sikdarraihansikdar
The document discusses face recognition technology for use in an automatic attendance system. It first defines biometrics and face recognition, explaining that face recognition identifies individuals using facial features. It then covers how face recognition systems work by detecting nodal points on faces to create unique face prints. The document proposes using such a system to take student attendance in online classes during the pandemic, noting advantages like ease of use, increased security, and cost effectiveness. It provides examples of how the system would capture images, analyze features, and recognize enrolled students to record attendance automatically.
Lec13: Clustering Based Medical Image Segmentation MethodsUlaş Bağcı
Clustering – K-means
– FCM (fuzzyc-means)
– SMC (simple membership based clustering) – AP(affinity propagation)
– FLAB(fuzzy locally adaptive Bayesian)
– Spectral Clustering Methods
ShapeModeling – M-reps – Active Shape Models (ASM) – Oriented Active Shape Models (OASM) – Application in anatomy recognition and segmentation – Comparison of ASM and OASM ActiveContour(Snake) • LevelSet • Applications Enhancement, Noise Reduction, and Signal Processing • MedicalImageRegistration • MedicalImageSegmentation • MedicalImageVisualization • Machine Learning in Medical Imaging • Shape Modeling/Analysis of Medical Images Deep Learning in Radiology Fuzzy Connectivity (FC) – Affinity functions • Absolute FC • Relative FC (and Iterative Relative FC) • Successful example applications of FC in medical imaging • Segmentation of Airway and Airway Walls using RFC based method Energy functional – Data and Smoothness terms • GraphCut – Min cut – Max Flow • ApplicationsinRadiologyImages
Semantic Segmentation on Satellite ImageryRAHUL BHOJWANI
This is an Image Semantic Segmentation project targeted on Satellite Imagery. The goal was to detect the pixel-wise segmentation map for various objects in Satellite Imagery including buildings, water bodies, roads etc. The data for this was taken from the Kaggle competition <https://www.kaggle.com/c/dstl-satellite-imagery-feature-detection>.
We implemented FCN, U-Net and Segnet Deep learning architectures for this task.
The document discusses human action recognition using spatio-temporal features. It proposes using optical flow and shape-based features to form motion descriptors, which are then classified using Adaboost. Targets are localized using background subtraction. Optical flows within localized regions are organized into a histogram to describe motion. Differential shape information is also captured. The descriptors are used to train a strong classifier with Adaboost that can recognize actions in testing videos.
Computer vision is the study of analyzing images and videos to understand and interpret visual content. It involves developing techniques to achieve tasks like object detection and recognition. Computer vision has many applications including optical character recognition, face detection, 3D modeling, robotics, medical imaging, and self-driving cars. OpenCV is a popular open source library for computer vision that contains over 2500 optimized algorithms and supports languages like C++, Python, and Java.
The document describes the Marching Cubes algorithm, which was developed in 1987 to construct 3D models from medical imaging data like CT scans. It works by dividing the volume into cubes and using the pixel values at the cube vertices to determine triangles that approximate the surface. There are 256 possible cases but they can be reduced to 14 basic patterns. The algorithm calculates surface normals to improve image quality and has been used to generate 3D models from various medical imaging modalities.
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [추적 24시] : 완전 자동결제를 위한 무인점포 이용자 Tracking System 개발BOAZ Bigdata
데이터 분석 프로젝트를 진행한 추적24시 팀에서는 아래와 같은 프로젝트를 진행했습니다.
완전 자동결제를 위한 무인점포 이용자 Tracking System 개발
19기 민경원 고려대학교 산업경영공학부
19기 신재욱 연세대학교 산업공학과
19기 이유빈 서울여자대학교 소프트웨어융합학과
19기 최가희 국립공주대학교 산업시스템공학과
Silhouette analysis based action recognition via exploiting human posesAVVENIRE TECHNOLOGIES
We propose a novel scheme for human action recognition that combines the advantages of both local and global representations.
We explore human silhouettes for human action representation by taking into account the correlation between sequential poses in an action.
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES sipij
Human action recognition is still a challenging problem and researchers are focusing to investigate this
problem using different techniques. We propose a robust approach for human action recognition. This is
achieved by extracting stable spatio-temporal features in terms of pairwise local binary pattern (P-LBP)
and scale invariant feature transform (SIFT). These features are used to train an MLP neural network
during the training stage, and the action classes are inferred from the test videos during the testing stage.
The proposed features well match the motion of individuals and their consistency, and accuracy is higher
using a challenging dataset. The experimental evaluation is conducted on a benchmark dataset commonly
used for human action recognition. In addition, we show that our approach outperforms individual features
i.e. considering only spatial and only temporal feature.
Deconstructing SfM-Net architecture and beyond
"SfM-Net, a geometry-aware neural network for motion estimation in videos that decomposes frame-to-frame pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations. Given a sequence of frames, SfM-Net predicts depth, segmentation, camera and rigid object motions, converts those into a dense frame-to-frame motion field (optical flow), differentiably warps frames in time to match pixels and back-propagates."
Alternative download:
https://www.dropbox.com/s/aezl7ro8sy2xq7j/sfm_net_v2.pdf?dl=0
Announcing the Final Examination of Mr. Paul Smith for the ...butest
Mr. Paul Smith will defend his dissertation titled "Multizoom Activity Recognition Using Machine Learning" on November 21, 2005 at 10:00 am in room CSB 232. His dissertation presents a system for detecting events in video using a multiview approach to detect and track heads and hands across cameras. It then demonstrates a machine learning algorithm called TemporalBoost that can recognize events using activity features extracted from multiple zoom levels. Mr. Smith received his B.Sc. from the University of Central Florida in 2000 and M.Sc. from the same institution in 2002. His dissertation committee is chaired by Drs. Mubarak Shah and Niels da Vitoria Lobo.
Falling costs with rising quality via hardware innovations and deep learning.
Technical introduction for scanning technologies from Structure-from-Motion (SfM), Range sensing (e.g. Kinect and Matterport) to Laser scanning (e.g. LiDAR), and the associated traditional and deep learning-based processing techniques.
Note! Due to small font size, and bad rendering by SlideShare, better to download the slides locally to your device
Alternative download link for the PDF:
https://www.dropbox.com/s/eclyy45k3gz66ve/proptech_emergingScanningTech.pdf?dl=0
This document discusses Motaz El Saban's research experience and interests which focus on analyzing, modeling, learning from, and predicting digital media content such as text, images, and speech. Some key areas of research include real-time video stitching, annotating mobile videos, object and activity recognition from videos, and facial expression recognition using deep learning techniques. The document also outlines El Saban's educational background and provides an agenda for his upcoming presentation.
Integrated Hidden Markov Model and Kalman Filter for Online Object Trackingijsrd.com
Visual prior from generic real-world images study to represent that objects in a scene. The existing work presented online tracking algorithm to transfers visual prior learned offline for online object tracking. To learn complete dictionary to represent visual prior with collection of real world images. Prior knowledge of objects is generic and training image set does not contain any observation of target object. Transfer learned visual prior to construct object representation using Sparse coding and Multiscale max pooling. Linear classifier is learned online to distinguish target from background and also to identify target and background appearance variations over time. Tracking is carried out within Bayesian inference framework and learned classifier is used to construct observation model. Particle filter is used to estimate the tracking result sequentially however, unable to work efficiently in noisy scenes. Time sift variance were not appropriated to track target object with observer value to prior information of object structure. Proposal HMM based kalman filter to improve online target tracking in noisy sequential image frames. The covariance vector is measured to identify noisy scenes. Discrete time steps are evaluated for identifying target object with background separation. Experiment conducted on challenging sequences of scene. To evaluate the performance of object tracking algorithm in terms of tracking success rate, Centre location error, Number of scenes, Learning object sizes, and Latency for tracking.
This document proposes methods for action recognition using context and appearance distribution features. It first extracts interest points from videos and represents each video using two types of distribution features: 1) multiple Gaussian mixture models (GMMs) that characterize the distribution of relative coordinates (context) between interest point pairs at different space-time scales, and 2) a GMM that models the distribution of local appearance features around interest points. It then introduces a new learning method called Multiple Kernel Learning with Augmented Features (AFMKL) to effectively fuse the two heterogeneous distribution features for classification. Experimental results on benchmark datasets demonstrate the effectiveness of the proposed approach.
Sparse representation based human action recognition using an action region-a...Wesley De Neve
This document presents a paper on sparse representation-based human action recognition using an action region-aware dictionary. It introduces the challenges of existing action recognition methods, including the lack of a general action detection method and the varying usefulness of context information depending on the action. The paper proposes constructing a dictionary containing separate context and action region information from training videos. It then presents a method to use this dictionary to adaptively classify human actions based on whether context region information is concentrated in the true class. The paper describes experiments on the UCF Sports Action dataset to evaluate the proposed method compared to existing sparse representation approaches.
Automatic Foreground object detection using Visual and Motion SaliencyIJERD Editor
This paper presents a saliency-based video object extraction (VOE) framework. The proposed framework aims to automatically extract foreground objects of interest without any user interaction or the use of any training data (i.e., not limited to any particular type of object). To separate foreground and background regions within and across video frames, the proposed method utilizes visual and motion saliency information extracted from the input video. A conditional random field is applied to effectively combine the saliency induced features, which allows us to deal with unknown pose and scale variations of the foreground object (and its articulated parts). Based on the ability to preserve both spatial continuity and temporal consistency in the proposed VOE framework, experiments on a variety of videos verify that our method is able to produce quantitatively and qualitatively satisfactory VOE results.
Image restoration techniques covered such as denoising, deblurring and super-resolution for 3D images and models.
From classical computer vision techniques to contemporary deep learning based processing for both ordered and unordered point clouds, depth maps and meshes.
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon TransformFadwa Fouad
This document provides an overview of a Masters thesis that proposes algorithms for human action recognition. It begins with an introduction that discusses the importance of human action recognition, challenges in the field, and differences between actions and activities. It then presents an agenda that outlines an introduction, overview, and details of two proposed algorithms: 2DHOOF/2DPCA contour-based optical flow and human gesture recognition using Radon transform/2DPCA. The overview section describes the general structure of action recognition systems from video capture to classification. Experimental results on benchmark datasets demonstrate the effectiveness of the proposed algorithms.
This document discusses action recognition in videos. It begins by defining action recognition and describing its applications such as surveillance, video search, and medical monitoring. Challenges of action recognition like scale variations, camera motion, and human pose differences are presented. The document reviews papers on local space-time features with SVMs and two-stream convolutional networks. It shows that local features combined with SVMs achieved the best results on a dataset of human actions. Two-stream ConvNets, which use spatial and temporal streams, became the state-of-the-art by capturing shape from frames and motion from optical flow. Future work may explore deeper ConvNets with larger datasets.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...Tulipp. Eu
- Computer vision has improved with more data and processing power, but global scene understanding remains challenging.
- The document proposes a multidisciplinary approach combining CNNs and human visual cognition to better model scene understanding, with the goal of applications like autonomous vehicles.
- It describes experiments observing how humans and primates recognize scenes to inform modeling, incorporating global and local descriptors with relationships. This approach aims to advance scene understanding capabilities.
The document provides an introduction to computer vision. It discusses key topics including:
- What computer vision is and why it is useful. It uses mathematical and computational tools to extract information from images and improve human vision.
- Some basic concepts in computer vision including digital images, sampling, noise removal, segmentation, and feature extraction techniques.
- Where computer vision is used such as healthcare, autonomous vehicles, augmented/virtual reality, industry, social media, security, agriculture, and fashion.
- A brief history of computer vision including classical approaches and the revolution enabled by advances in artificial intelligence and deep learning.
IMAGE PROCESSING Projects for M. Tech, IMAGE PROCESSING Projects in Vijayanagar, IMAGE PROCESSING Projects in Bangalore, M. Tech Projects in Vijayanagar, M. Tech Projects in Bangalore, IMAGE PROCESSING IEEE projects in Bangalore, IEEE 2015 IMAGE PROCESSING Projects, MATLAB Image Processing Projects, MATLAB Image Processing Projects in Bangalore, MATLAB Image Processing Projects in Vijayangar
This document discusses a real-time moving object detection algorithm using Speeded Up Robust Features (SURF). The algorithm first uses SURF to extract features from video frames and stitch them together to create a single background image. It then takes the difference between each video frame and the background image to detect moving objects. Experimental results show it can effectively detect moving objects in videos from both stable and moving cameras, outperforming traditional background subtraction techniques. The algorithm solves issues with modeling changing backgrounds in real-time by creating a static background image using frame registration with SURF.
Similar to Action Genome: Action As Composition of Spatio Temporal Scene Graphs (20)
The document discusses recent developments in video transformers. It summarizes several recent works that employ spatial backbones like ViT or ResNet combined with temporal transformers for video classification. Examples mentioned include VTN, TimeSformer, STAM, and ViViT. The document also discusses common practices in video transformer inference, like using multiple clips/crops and averaging predictions. Design choices covered include number of frames, spatial dimensions, and multi-view inference techniques.
An Empirical Study of Training Self-Supervised Vision Transformers.pptxSangmin Woo
Chen, Xinlei, Saining Xie, and Kaiming He. "An empirical study of training self-supervised vision transformers." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
This document discusses video grounding, which aims to localize moments in video corresponding to natural language descriptions. It defines related tasks like natural language video localization and video moment retrieval. It lists keywords for these tasks and provides examples of approaches, applications, and GitHub resources for temporal language grounding in videos.
This document summarizes several action recognition datasets for human activities. It describes both single-label datasets that classify entire videos, as well as multi-label datasets that temporally localize actions within videos. It also categorizes datasets as generic, instructional, ego-centric, compositional, multi-view, or multi-modal depending on the type of activities and data modalities included. Several prominent multi-modal datasets are highlighted, such as PKU-MMD, NTU RGB+D, MMAct, and HOMAGE, which provide video alongside additional modalities like depth, infrared, audio, and sensor data.
Chen, X., & He, K. (2021). Exploring Simple Siamese Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15750-15758).
[2020 ICLR] Reformer: The Efficient Transformer
[2020 ICML] Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
[2020 NIPS] Big Bird: Transformers for Longer Sequences
[2021 ICLR] Rethinking Attention with Performers
Transformer Architectures in Vision
[2018 ICML] Image Transformer
[2019 CVPR] Video Action Transformer Network
[2020 ECCV] End-to-End Object Detection with Transformers
[2021 ICLR] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Neural motifs scene graph parsing with global contextSangmin Woo
Zellers, Rowan, et al. "Neural Motifs: Scene Graph Parsing With Global Context." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
Attentive Relational Networks for Mapping Images to Scene GraphsSangmin Woo
M. Qi, W. Li, Z. Yang, Y. Wang, and J. Luo.: Attentive relational networks for mapping images to scene graphs. In The
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Best 20 SEO Techniques To Improve Website Visibility In SERP
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
1. 2020 IRRLAB Presentation
IRRLAB
Action Genome:
Actions as Compositions of Spatio-
temporal Scene Graphs
Sangmin Woo
2020.07.09
Jingwei Ji Ranjay Krishna Li Fei-Fei Juan Carlos Niebles
Stanford University
CVPR 2020
3. 3 / 21
IRRLABWhy Action Genome?
Scene Graph
• The scene graph representation provides a scaffold that allows vision
models to tackle complex inference tasks by breaking scenes into its
corresponding objects and their visual relationships.
[1] Zellers, R., Yatskar, M., Thomson, S., & Choi, Y. (2018). Neural Motifs: Scene Graph Parsing with Global Context. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (pp. 5831-5840).
4. 4 / 21
IRRLABWhy Action Genome?
Versatility of Scene Graph
• A number of works have proposed deep network-based approaches for generating
the scene graphs, confirming its importance to the field
• e.g., Image retrieval, Visual Question Answering, Image Captioning, Image
Generation, Video Understanding
[2] Ashual, O., & Wolf, L. (2019). Specifying Object Attributes and Relations in Interactive Scene Generation. In Proceedings of the IEEE
International Conference on Computer Vision (pp. 4561-4569).
5. 5 / 21
IRRLABWhy Action Genome?
Scene Graph in Video?!
• However, decompositions for temporal events has not been explored
much, even though representing events with structured representations
could lead to more accurate and grounded action understanding.
• Meanwhile, people segment events into consistent groups and actively
encode those ongoing activities in a hierarchical part structure.
6. 6 / 21
IRRLABWhy Action Genome?
Benefits?
• Such decompositions can enable machines to predict future and past
scene graphs with objects and relationships as an action occurs
• we can predict that the person is about to sit on the sofa when we see
them move in front of it.
• Similarly, such decomposition can also enable machines to learn from
few examples
• we can recognize the same action when we see a different person
move towards a different chair.
7. 7 / 21
IRRLABWhy Action Genome?
Benefits?
• Such decompositions can enable machines to predict future and past
scene graphs with objects and relationships as an action occurs
• we can predict that the person is about to sit on the sofa when we see
them move in front of it.
• Similarly, such decomposition can also enable machines to learn from
few examples
• we can recognize the same action when we see a different person
move towards a different chair.
• Okay… such decompositions can provide the scaffolds to improve vision
models... Then, how is it possible to correctly create representative
hierarchies for a wide variety of complex actions?
8. 8 / 21
IRRLABWhy Action Genome?
Benefits?
• Such decompositions can enable machines to predict future and past
scene graphs with objects and relationships as an action occurs
• we can predict that the person is about to sit on the sofa when we see
them move in front of it.
• Similarly, such decomposition can also enable machines to learn from
few examples
• we can recognize the same action when we see a different person
move towards a different chair.
• Okay… such decompositions can provide the scaffolds to improve vision
models... Then, how is it possible to correctly create representative
hierarchies for a wide variety of complex actions?
• Action Genome!
9. 9 / 21
IRRLABAction Genome
Action Genome
• A representation that decomposes actions into spatio-temporal scene
graphs.
• Object detection faced a similar challenge of large variation within any
object category. So, just as progress in 2D perception was catalyzed by
taxonomies, partonomies, and ontologies.
• Action Genome aims to improve temporal understanding by partonomy.
Data Statistics
• Built upon Charades, Action Genome provides
• 476K object bounding boxes of 35 object classes (excluding “person”)
• 1.72M relationships of 25 relationship classes
• 234K video frames
• 157 action categories
10. 10/ 21
IRRLABAction Genome
Charades vs. Action Genome
• Charades provides an injective mapping from each action to a verb
• Charades’ verbs are clip-level labels, such as “𝑎𝑤𝑎𝑘𝑒𝑛”
• Action Genome decompose them into frame-level human-object
relationships, such as a sequence of < 𝑝𝑒𝑟𝑠𝑜𝑛 – 𝑙𝑦𝑖𝑛𝑔 𝑜𝑛 – 𝑏𝑒𝑑 >, <
𝑝𝑒𝑟𝑠𝑜𝑛 – 𝑠𝑖𝑡𝑡𝑖𝑛𝑔 𝑜𝑛 – 𝑏𝑒𝑑 > and < 𝑝𝑒𝑟𝑠𝑜𝑛 – 𝑛𝑜𝑡 𝑐𝑜𝑛𝑡𝑎𝑐𝑡𝑖𝑛𝑔 – 𝑏𝑒𝑑 >.
11. 11 / 21
IRRLABAction Genome
Action Genome
• Most perspectives on action decomposition converge on the prototypical
unit of action-object couplets.
• Action-object couplets refer to transitive actions performed on objects
(e.g. “moving a chair” or “throwing a ball”) and intransitive self-actions
(e.g. “moving towards the sofa”).
• Action Genome’s dynamic scene graph representations capture both such
types of events and as such, represent the prototypical unit.
• This representation enables the study for tasks such as:
• Spatio-temporal scene graph prediction
• Improving action recognition
• Few-shot action detection
12. 12/ 21
IRRLABAction Genome
Three types of relationships in Action Genome
• Attention: which objects people are looking at?
• Spatial: how objects are laid out spatially?
• Contact: semantic relationships involving people manipulating objects.
13. 13/ 21
IRRLABAction Genome
Action Genome’s annotation pipeline
• For every action, 5 frames are uniformly sampled across the action.
14. 14/ 21
IRRLABMethod (for Action Recognition)
Action Genome for Action Recognition
• Scene Graph Feature Banks (SGFB): It accumulates features of a long video as a
fixed size representation for action recognition.
1. SGFB generates a spatio-temporal scene graph for every frame.
2. Scene graphs are then encoded to formulate a feature banks.
3. Feature banks are combined with 3D CNN features using feature bank operators
(FBO), which can be instantiated as mean/max pooling or non-local blocks.
4. The final aggregated feature is used to predict action labels for the video.
[3] Wu, C. Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., & Girshick, R. (2019). Long-Term Feature Banks for Detailed Video
Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 284-293).
16. 16/ 21
IRRLABExperiments
Action Recognition on Charades
• The Charades dataset contains 9, 848 crowdsourced videos with a length
of 30 seconds on average.
• SGFB outperformed all existing methods when simultaneously predict
scene graphs while performing action recognition (44.3% mAP).
• Utilizing ground truth scene graphs (Oracle) can significantly boost
performance (60.3% mAP).
17. 17/ 21
IRRLABExperiments
Few-shot action recognition
• Intuitively, predicting actions should be easier (and well generalize to rare
actions) from a symbolic embedding of scene graphs than from pixels.
1) 157 action classes are split into a base set of 137 classes and a novel set
of 20 classes.
2) A backbone feature extractor (R101-I3D-NL) is first trained on all video
examples of the base classes, which is shared by the baseline LFB, SGFB,
and SGFB oracle.
3) Next, each model are trained with only k examples from each novel class,
where k = 1, 5, 10, for 50 epochs.
4) Finally, models are evaluated on all examples of novel classes in the
Charades validation set.
18. 18/ 21
IRRLABExperiments
Spatio-temporal scene graph prediction
• Unlike image-based scene graph prediction, which only has a single
image as input,
• This task expects a video as input and therefore, can utilize temporal
information from neighboring frames to strength its predictions.
• There is significant room for improvement, especially since these existing
methods were designed to be conditioned on a single frame and do not
consider the entire video sequence as a whole.
19. 19/ 21
IRRLABExperiments
Summary
1. Predicting scene graphs can benefit action recognition
• Improving the state-of-the-art on the Charades dataset from 42.5% to
44.3% and to 60.3% when using oracle scene graphs.
2. Compositional understanding of actions induces better generalization.
• In few-shot action recognition, it achieves 42.7% mAP using as few
as 10 training examples.
3. In new task of spatio-temporal scene graph prediction, there is significant
room for improvement
• Existing methods were designed to be conditioned on a single frame
and do not consider the entire video sequence as a whole.
20. 20/ 21
IRRLABFuture Work
Future Research Directions
• Spatio-temporal action localization
• The majority of action localization methods focus on localizing the person
performing the action but ignore the objects.
• Action Genome can enable research on localization of both actors and objects.
• Explainable action models
• Action Genome provides frame-level labels of attention in the form of objects
that a person performing the action is either looking at or interacting with.
• These labels can be used to further train explainable models.
• Video generation from spatio-temporal scene graphs
Let’s consider the action of “sitting on a sofa”. The person initially starts off next to the sofa, moves in front of it, and finally sits atop it.
There are three types of relationships in Action Genome: attention relationships report which objects people are looking at, spatial relationships indicate how objects are laid out spatially, and contact relationships are semantic relationships involving people manipulating objects.
There are three types of relationships in Action Genome: attention relationships report which objects people are looking at, spatial relationships indicate how objects are laid out spatially, and contact relationships are semantic relationships involving people manipulating objects.
Action Genome’s annotation pipeline: For every action, we uniformly sample 5 frames across the action and annotate the person performing the action along with the objects they interact with. We also annotate the pairwise relationships between the person and those objects. Here, we show a video with 4 actions labelled, resulting in 20 (= 4 5) frames annotated with scene graphs. The objects are grounded back in the video as bounding boxes.
There are three types of relationships in Action Genome: attention relationships report which objects people are looking at, spatial relationships indicate how objects are laid out spatially, and contact relationships are semantic relationships involving people manipulating objects.
Such a boost in performance shows the potential of Action Genome and compositional action understanding when video-based scene graph models are utilized to improve scene graph prediction.
When trained with very few examples, compositional action understanding with additional knowledge of scene graphs should outperform methods that treat actions as monolithic concept.
The small gap in performance between the task of PredCls and SGCls suggests that these models suffer from not being able to accurately detect the objects in the video frames. Improving object detectors designed specifically for videos could improve performance.
Amongst numerous techniques, saliency prediction has emerged as a key mechanism to interpret machine learning models