This document proposes a key frame extraction method for efficient and accurate activity recognition in videos using deep learning. It introduces a multi-stream convolutional neural network architecture that takes in multiple representations of video data as input, including different color spaces (RGB, YCbCr) and optical flow. Key frames are extracted from videos based on optical flow, representing important moments while reducing computational requirements compared to using all frames. The method is evaluated on the UCF-101 dataset and shows promising results for video classification using different combinations of color spaces and optical flow as input to the multi-stream CNN model.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2021/10/person-re-identification-and-tracking-at-the-edge-challenges-and-techniques-a-presentation-from-the-university-of-auckland/
Morteza Biglari-Abhari, Senior Lecturer at the University of Auckland, presents the “Person Re-Identification and Tracking at the Edge: Challenges and Techniques” tutorial at the May 2021 Embedded Vision Summit.
Numerous video analytics applications require understanding how people are moving through a space, including the ability to recognize when the same person has moved outside of the camera’s view and then back into the camera’s view, or when a person has passed from the view of one camera to the view of another. This capability is referred to as person re-identification and tracking. It’s an essential technique for applications such as surveillance for security, health and safety monitoring in healthcare and industrial facilities, intelligent transportation systems and smart cities. It can also assist in gathering business intelligence such as monitoring customer behavior in shopping environments. Person re-identification is challenging.
In this talk, Biglari-Abhari discusses the key challenges and current approaches for person re-identification and tracking, as well as his initial work on multi-camera systems and techniques to improve accuracy, especially fusing appearance and spatio-temporal models. He also briefly discusses privacy-preserving techniques, which are critical for some applications, as well as challenges for real-time processing at the edge.
An Analysis of Various Deep Learning Algorithms for Image Processingvivatechijri
Various applications of image processing has given it a wider scope when it comes to data analysis.
Various Machine Learning Algorithms provide a powerful environment for training modules effectively to
identify various entities of images and segment the same accordingly. Rather one can observe that though the
image classifiers like the Support Vector Machines (SVM) or Random Forest Algorithms do justice to the task,
deep learning algorithms like the Artificial Neural Networks (ANN) and its subordinates, the very well-known
and extremely powerful Algorithm Convolution Neural Networks (CNN) can provide a new dimension to the
image processing domain. It has way higher accuracy and computational power for classifying images further
and segregating their various entities as individual components of the image working region. Major focus will
be on the Region Convolution Neural Networks (R-CNN) algorithm and how well it provides the pixel-level
segmentation further using its better successors like the Fast-Faster and Mask R-CNN versions.
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...CSCJournals
In today's era of digitization and fast internet, many video are uploaded on websites, a mechanism is required to access this video accurately and efficiently. Semantic concept detection achieve this task accurately and is used in many application like multimedia annotation, video summarization, annotation, indexing and retrieval. Video retrieval based on semantic concept is efficient and challenging research area. Semantic concept detection bridges the semantic gap between low level extraction of features from key-frame or shot of video and high level interpretation of the same as semantics. Semantic Concept detection automatically assigns labels to video from predefined vocabulary. This task is considered as supervised machine learning problem. Support vector machine (SVM) emerged as default classifier choice for this task. But recently Deep Convolutional Neural Network (CNN) has shown exceptional performance in this area. CNN requires large dataset for training. In this paper, we present framework for semantic concept detection using hybrid model of SVM and CNN. Global features like color moment, HSV histogram, wavelet transform, grey level co-occurrence matrix and edge orientation histogram are selected as low level features extracted from annotated groundtruth video dataset of TRECVID. In second pipeline, deep features are extracted using pretrained CNN. Dataset is partitioned in three segments to deal with data imbalance issue. Two classifiers are separately trained on all segments and fusion of scores is performed to detect the concepts in test dataset. The system performance is evaluated using Mean Average Precision for multi-label dataset. The performance of the proposed framework using hybrid model of SVM and CNN is comparable to existing approaches.
PREVENTING COPYRIGHTS INFRINGEMENT OF IMAGES BY WATERMARKING IN TRANSFORM DOM...ijistjournal
Images are undoubtedly the most efficacious and easiest means of communicating an idea. They are surely an indispensable part of human life .The trend of sharing images of various kinds for example typical technical figures, modern exceptional masterpiece from an artist, photos from the recent picnic to hill station etc, on the internet is spreading like a viral. There is a mandatory requirement for checking the privacy and security of our personal digital images before making them public via the internet. There is always a threat of our original images being illegally reproduced or distributed elsewhere. To prevent the misuse and protect the copyrights, an efficient solution has been given that can withstand many attacks. This paper aims at encoding of the host image prior to watermark embedding for enhancing the security. The fast and effective full counter propagation neural network helps in the successful watermark embedding without deteriorating the image perception. Earlier techniques embedded the watermark in the image itself but is has been observed that synapses of neural network provide a better platform for reducing the distortion and increasing the message capacity.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2021/10/person-re-identification-and-tracking-at-the-edge-challenges-and-techniques-a-presentation-from-the-university-of-auckland/
Morteza Biglari-Abhari, Senior Lecturer at the University of Auckland, presents the “Person Re-Identification and Tracking at the Edge: Challenges and Techniques” tutorial at the May 2021 Embedded Vision Summit.
Numerous video analytics applications require understanding how people are moving through a space, including the ability to recognize when the same person has moved outside of the camera’s view and then back into the camera’s view, or when a person has passed from the view of one camera to the view of another. This capability is referred to as person re-identification and tracking. It’s an essential technique for applications such as surveillance for security, health and safety monitoring in healthcare and industrial facilities, intelligent transportation systems and smart cities. It can also assist in gathering business intelligence such as monitoring customer behavior in shopping environments. Person re-identification is challenging.
In this talk, Biglari-Abhari discusses the key challenges and current approaches for person re-identification and tracking, as well as his initial work on multi-camera systems and techniques to improve accuracy, especially fusing appearance and spatio-temporal models. He also briefly discusses privacy-preserving techniques, which are critical for some applications, as well as challenges for real-time processing at the edge.
An Analysis of Various Deep Learning Algorithms for Image Processingvivatechijri
Various applications of image processing has given it a wider scope when it comes to data analysis.
Various Machine Learning Algorithms provide a powerful environment for training modules effectively to
identify various entities of images and segment the same accordingly. Rather one can observe that though the
image classifiers like the Support Vector Machines (SVM) or Random Forest Algorithms do justice to the task,
deep learning algorithms like the Artificial Neural Networks (ANN) and its subordinates, the very well-known
and extremely powerful Algorithm Convolution Neural Networks (CNN) can provide a new dimension to the
image processing domain. It has way higher accuracy and computational power for classifying images further
and segregating their various entities as individual components of the image working region. Major focus will
be on the Region Convolution Neural Networks (R-CNN) algorithm and how well it provides the pixel-level
segmentation further using its better successors like the Fast-Faster and Mask R-CNN versions.
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...CSCJournals
In today's era of digitization and fast internet, many video are uploaded on websites, a mechanism is required to access this video accurately and efficiently. Semantic concept detection achieve this task accurately and is used in many application like multimedia annotation, video summarization, annotation, indexing and retrieval. Video retrieval based on semantic concept is efficient and challenging research area. Semantic concept detection bridges the semantic gap between low level extraction of features from key-frame or shot of video and high level interpretation of the same as semantics. Semantic Concept detection automatically assigns labels to video from predefined vocabulary. This task is considered as supervised machine learning problem. Support vector machine (SVM) emerged as default classifier choice for this task. But recently Deep Convolutional Neural Network (CNN) has shown exceptional performance in this area. CNN requires large dataset for training. In this paper, we present framework for semantic concept detection using hybrid model of SVM and CNN. Global features like color moment, HSV histogram, wavelet transform, grey level co-occurrence matrix and edge orientation histogram are selected as low level features extracted from annotated groundtruth video dataset of TRECVID. In second pipeline, deep features are extracted using pretrained CNN. Dataset is partitioned in three segments to deal with data imbalance issue. Two classifiers are separately trained on all segments and fusion of scores is performed to detect the concepts in test dataset. The system performance is evaluated using Mean Average Precision for multi-label dataset. The performance of the proposed framework using hybrid model of SVM and CNN is comparable to existing approaches.
PREVENTING COPYRIGHTS INFRINGEMENT OF IMAGES BY WATERMARKING IN TRANSFORM DOM...ijistjournal
Images are undoubtedly the most efficacious and easiest means of communicating an idea. They are surely an indispensable part of human life .The trend of sharing images of various kinds for example typical technical figures, modern exceptional masterpiece from an artist, photos from the recent picnic to hill station etc, on the internet is spreading like a viral. There is a mandatory requirement for checking the privacy and security of our personal digital images before making them public via the internet. There is always a threat of our original images being illegally reproduced or distributed elsewhere. To prevent the misuse and protect the copyrights, an efficient solution has been given that can withstand many attacks. This paper aims at encoding of the host image prior to watermark embedding for enhancing the security. The fast and effective full counter propagation neural network helps in the successful watermark embedding without deteriorating the image perception. Earlier techniques embedded the watermark in the image itself but is has been observed that synapses of neural network provide a better platform for reducing the distortion and increasing the message capacity.
Modified Skip Line Encoding for Binary Image Compressionidescitation
Image Compression is an important issue in
Internet, mobile communication, digital library, digital
photography, multimedia, teleconferencing and other
applications. Application areas of Image Compression would
focus on the problem of optimizing storage space and
transmission bandwidth. In this paper, a modified form of skip
line encoding is proposed to further reduce the redundancy in
the image. The performance is found to be better than the
skip-line encoding.
Efficient Image Compression Technique using Clustering and Random PermutationIJERA Editor
Multimedia data compression is a challenging situation for compression technique, due to the possibility of loss
of data as well as it require large amount of storage place. The minimization of storage place and proper
transmission of these data need compression. In this dissertation we proposed a block based DWT image
compression technique using genetic algorithm and HCC code matrix. The HCC code matrix compressed into
two different set redundant and non-redundant which generate similar pattern of block coefficient. The similar
block coefficient generated by particle of swarm optimization. The process of particle of swarm optimization is
select for the optimal block of DWT transform function. For the experimental purpose we used some standard
image such as Lena, Barbara and cameraman image. The size of resolution of this image is 256*256. The source
of image is Google.
Encryption-Decryption RGB Color Image Using Matrix Multiplicationijcsit
An enhanced technique of color image encryption based on random matrix key encoding is proposed. To
encrypt the color image a separation into Red Green and Blue (R, G, B) channels will applied. Each
channel is encrypted using a technique called double random matrix key encoding then three new coding
image matrices are constructed. To obtain the reconstructed image that is the same as the original image
in the receipted side; simple extracted and decryption operations can be maintained. The results shown
that the proposed technique is powerful for color image encryption and decryption and a MATLAB and
simulations were used to get the results.
The proposed technique has high security features because each color component is separately treated
using its own double random matrix key which is generated randomly and make the process of hacking the
three keys very difficult.
Comparison of different Fingerprint Compression Techniquessipij
The important features of wavelet transform and different methods in compression of fingerprint images have been implemented. Image quality is measured objectively using peak signal to noise ratio (PSNR) and mean square error (MSE).A comparative study using discrete cosine transform based Joint Photographic Experts Group(JPEG) standard , wavelet based basic Set Partitioning in Hierarchical trees(SPIHT) and Modified SPIHT is done. The comparison shows that Modified SPIHT offers better compression than basic SPIHT and JPEG. The results will help application developers to choose a good wavelet compression system for their applications.
HARDWARE SOFTWARE CO-SIMULATION FOR TRAFFIC LOAD COMPUTATION USING MATLAB SIM...ijcsity
Due to increase in number of vehicles, Traffic is a major problem faced in urban areas throughout the
world. This document presents a newly developed Matlab Simulink model to compute traffic load for real
time traffic signal control. Signal processing, video and image processing and Xilinx Blockset have been
extensively used for traffic load computation. The approach used is Edge detection operation, wherein,
Edges are extracted to identify the number of vehicles. The developed model computes the results with
greater degrees of accuracy and is capable of being used to set the green signal duration so as to release
the traffic dynamically on traffic junctions.
Xilinx System Generator (XSG) provides Simulink Blockset for several hardware operations that could be
implemented on various Xilinx Field programmable gate arrays (FPGAs). The method described in this
paper involves object feature identification and detection. Xilinx System Generator provides some blocks to
transform data provided from the software side of the simulation environment to the hardware side. In our
case it is MATLAB Simulink to System Generator blocks. This is an important concept to understand in the
design process using Xilinx System Generator. The Xilinx System Generator, embedded in MATLAB
Simulink is used to program the model and then test on the FPGA board using the properties of hardware
co-simulation tools.
The growing trend of online image sharing and downloads today mandate the need for better encoding and
decoding scheme. This paper looks into this issue of image coding. Multiple Description Coding is an
encoding and decoding scheme that is specially designed in providing more error resilience for data
transmission. The main issue of Multiple Description Coding is the lossy transmission channels. This work
attempts to address the issue of re-constructing high quality image with the use of just one descriptor
rather than the conventional descriptor. This work compare the use of Type I quantizer and Type II
quantizer. We propose and compare 4 coders by examining the quality of re-constructed images. The 4
coders are namely JPEG HH (Horizontal Pixel Interleaving with Huffman Coding) model, JPEG HA
(Horizontal Pixel Interleaving with Arithmetic Encoding) model, JPEG VH (Vertical Pixel Interleaving
with Huffman Encoding) model, and JPEG VA (Vertical Pixel Interleaving with Arithmetic Encoding)
model. The findings suggest that the use of horizontal and vertical pixel interleavings do not affect the
results much. Whereas the choice of quantizer greatly affect its performance.
DeepVO - Towards Visual Odometry with Deep LearningJacky Liu
Author:
Sen Wang1,2, Ronald Clark2, Hongkai Wen2 and Niki Trigoni2
1. Edinburgh Centre for Robotics, Heriot-Watt University, UK
2. University of Oxford, UK
Download this paper: http://senwang.gitlab.io/DeepVO/#paper
Watch video: http://senwang.gitlab.io/DeepVO/#video
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...INFOGAIN PUBLICATION
Locally linear embedding (LLE) is an unsupervised learning algorithm which computes the low dimensional, neighborhood preserving embeddings of high dimensional data. LLE attempts to discover non-linear structure in high dimensional data by exploiting the local symmetries of linear reconstructions. In this paper, video feature extraction is done using modified LLE alongwith adaptive nearest neighbor approach to find the nearest neighbor and the connected components. The proposed feature extraction method is applied to a video. The video feature description gives a new tool for analysis of video.
Automated License Plate Recognition for Toll Booth ApplicationIJERA Editor
This paper describes the Smart Vehicle Screening System, which can be installed into a tollbooth for automated recognition of vehicle license plate information using a photograph of a vehicle. An automated system could then be implemented to control the payment of fees, parking areas, highways, bridges or tunnels, etc. There are considered an approach to identify vehicle through recognizing of it license plate using image fusion, neural networks and threshold techniques as well as some experimental results to recognize the license plate successfully.
Robust Image Watermarking Scheme Based on Wavelet TechniqueCSCJournals
In this paper, an image watermarking scheme based on multi bands wavelet transformation method is proposed. At first, the proposed scheme is tested on the spatial domain (for both a non and semi blind techniques) in order to compare its results with a frequency domain. In the frequency domain, an adaptive scheme is designed and implemented based on the bands selection criteria to embed the watermark. These criteria depend on the number of wavelet passes. In this work three methods are developed to embed the watermark (one band (LL|HH|HL|LH), two bands (LL&HH | LL&HL | LL&LH | HL&LH | HL&HH | LH&HH) and three bands (LL&HL&LH | LL&HH&HL | LL&HH&LH | LH&HH&HL) selection. The analysis results indicate that the performance of the proposed watermarking scheme for the non-blind scheme is much better than semi-blind scheme in terms of similarity of extracted watermark, while the security of semi-blind is relatively high. The results show that in frequency domain when the watermark is added to the two bands (HL and LH) for No. of pass =3 led to good correlation between original and extracted watermark around (similarity = 99%), and leads to reconstructed images of good objective quality (PSNR=24 dB) after JPEG compression attack (QF=25). The disadvantage of the scheme is the involvement of a large number of wavelet bands in the embedding process.
High Speed and Area Efficient 2D DWT Processor Based Image Compressionsipij
This paper presents a high speed and area efficient DWT processor based design for Image Compression applications. In this proposed design, pipelined partially serial architecture has been used to enhance the speed along with optimal utilization and resources available on target FPGA. The proposed model has been designed and simulated using Simulink and System Generator blocks, synthesized with Xilinx Synthesis tool (XST) and implemented on Spartan 2 and 3 based XC2S100-5tq144 and XC3S500E-4fg320 target device. The results show that proposed design can operate at maximum frequency 231 MHz in case of Spartan 3 by consuming power of 117mW at 28 degree/c junction temperature. The result comparison has shown an improvement of 15% in speed.
Steganography is a best method for in secret communicating information during the transference of data. Images are an appropriate method that used in steganography can be used to protection the simple bits and pieces. Several systems, this one as color scale images steganography and grayscale images steganography, are used on color and store data in different techniques. These color images can have very big amounts of secret data, by using three main color modules. The different color modules, such as HSV-(hue, saturation, and value), RGB-(red, green, and blue), YCbCr-(luminance and chrominance), YUV, YIQ, etc. This paper uses unusual module to hide data: an adaptive procedure that can increase security ranks when hiding a top secret binary image in a RGB color image, which we implement the steganography in the YCbCr module space. We performed Exclusive-OR (XOR) procedures between the binary image and the RGB color image in the YCBCR module space. The converted byte stored in the 8-bit LSB is not the actual bytes; relatively, it is obtained by translation to another module space and applies the XOR procedure. This technique is practical to different groups of images. Moreover, we see that the adaptive technique ensures good results as the peak signal to noise ratio (PSNR) and stands for mean square error (MSE) are good. When the technique is compared with our previous works and other existing techniques, it is shown to be the best in both error and message capability. This technique is easy to model and simple to use and provides perfect security with unauthorized.
Modified Skip Line Encoding for Binary Image Compressionidescitation
Image Compression is an important issue in
Internet, mobile communication, digital library, digital
photography, multimedia, teleconferencing and other
applications. Application areas of Image Compression would
focus on the problem of optimizing storage space and
transmission bandwidth. In this paper, a modified form of skip
line encoding is proposed to further reduce the redundancy in
the image. The performance is found to be better than the
skip-line encoding.
Efficient Image Compression Technique using Clustering and Random PermutationIJERA Editor
Multimedia data compression is a challenging situation for compression technique, due to the possibility of loss
of data as well as it require large amount of storage place. The minimization of storage place and proper
transmission of these data need compression. In this dissertation we proposed a block based DWT image
compression technique using genetic algorithm and HCC code matrix. The HCC code matrix compressed into
two different set redundant and non-redundant which generate similar pattern of block coefficient. The similar
block coefficient generated by particle of swarm optimization. The process of particle of swarm optimization is
select for the optimal block of DWT transform function. For the experimental purpose we used some standard
image such as Lena, Barbara and cameraman image. The size of resolution of this image is 256*256. The source
of image is Google.
Encryption-Decryption RGB Color Image Using Matrix Multiplicationijcsit
An enhanced technique of color image encryption based on random matrix key encoding is proposed. To
encrypt the color image a separation into Red Green and Blue (R, G, B) channels will applied. Each
channel is encrypted using a technique called double random matrix key encoding then three new coding
image matrices are constructed. To obtain the reconstructed image that is the same as the original image
in the receipted side; simple extracted and decryption operations can be maintained. The results shown
that the proposed technique is powerful for color image encryption and decryption and a MATLAB and
simulations were used to get the results.
The proposed technique has high security features because each color component is separately treated
using its own double random matrix key which is generated randomly and make the process of hacking the
three keys very difficult.
Comparison of different Fingerprint Compression Techniquessipij
The important features of wavelet transform and different methods in compression of fingerprint images have been implemented. Image quality is measured objectively using peak signal to noise ratio (PSNR) and mean square error (MSE).A comparative study using discrete cosine transform based Joint Photographic Experts Group(JPEG) standard , wavelet based basic Set Partitioning in Hierarchical trees(SPIHT) and Modified SPIHT is done. The comparison shows that Modified SPIHT offers better compression than basic SPIHT and JPEG. The results will help application developers to choose a good wavelet compression system for their applications.
HARDWARE SOFTWARE CO-SIMULATION FOR TRAFFIC LOAD COMPUTATION USING MATLAB SIM...ijcsity
Due to increase in number of vehicles, Traffic is a major problem faced in urban areas throughout the
world. This document presents a newly developed Matlab Simulink model to compute traffic load for real
time traffic signal control. Signal processing, video and image processing and Xilinx Blockset have been
extensively used for traffic load computation. The approach used is Edge detection operation, wherein,
Edges are extracted to identify the number of vehicles. The developed model computes the results with
greater degrees of accuracy and is capable of being used to set the green signal duration so as to release
the traffic dynamically on traffic junctions.
Xilinx System Generator (XSG) provides Simulink Blockset for several hardware operations that could be
implemented on various Xilinx Field programmable gate arrays (FPGAs). The method described in this
paper involves object feature identification and detection. Xilinx System Generator provides some blocks to
transform data provided from the software side of the simulation environment to the hardware side. In our
case it is MATLAB Simulink to System Generator blocks. This is an important concept to understand in the
design process using Xilinx System Generator. The Xilinx System Generator, embedded in MATLAB
Simulink is used to program the model and then test on the FPGA board using the properties of hardware
co-simulation tools.
The growing trend of online image sharing and downloads today mandate the need for better encoding and
decoding scheme. This paper looks into this issue of image coding. Multiple Description Coding is an
encoding and decoding scheme that is specially designed in providing more error resilience for data
transmission. The main issue of Multiple Description Coding is the lossy transmission channels. This work
attempts to address the issue of re-constructing high quality image with the use of just one descriptor
rather than the conventional descriptor. This work compare the use of Type I quantizer and Type II
quantizer. We propose and compare 4 coders by examining the quality of re-constructed images. The 4
coders are namely JPEG HH (Horizontal Pixel Interleaving with Huffman Coding) model, JPEG HA
(Horizontal Pixel Interleaving with Arithmetic Encoding) model, JPEG VH (Vertical Pixel Interleaving
with Huffman Encoding) model, and JPEG VA (Vertical Pixel Interleaving with Arithmetic Encoding)
model. The findings suggest that the use of horizontal and vertical pixel interleavings do not affect the
results much. Whereas the choice of quantizer greatly affect its performance.
DeepVO - Towards Visual Odometry with Deep LearningJacky Liu
Author:
Sen Wang1,2, Ronald Clark2, Hongkai Wen2 and Niki Trigoni2
1. Edinburgh Centre for Robotics, Heriot-Watt University, UK
2. University of Oxford, UK
Download this paper: http://senwang.gitlab.io/DeepVO/#paper
Watch video: http://senwang.gitlab.io/DeepVO/#video
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...INFOGAIN PUBLICATION
Locally linear embedding (LLE) is an unsupervised learning algorithm which computes the low dimensional, neighborhood preserving embeddings of high dimensional data. LLE attempts to discover non-linear structure in high dimensional data by exploiting the local symmetries of linear reconstructions. In this paper, video feature extraction is done using modified LLE alongwith adaptive nearest neighbor approach to find the nearest neighbor and the connected components. The proposed feature extraction method is applied to a video. The video feature description gives a new tool for analysis of video.
Automated License Plate Recognition for Toll Booth ApplicationIJERA Editor
This paper describes the Smart Vehicle Screening System, which can be installed into a tollbooth for automated recognition of vehicle license plate information using a photograph of a vehicle. An automated system could then be implemented to control the payment of fees, parking areas, highways, bridges or tunnels, etc. There are considered an approach to identify vehicle through recognizing of it license plate using image fusion, neural networks and threshold techniques as well as some experimental results to recognize the license plate successfully.
Robust Image Watermarking Scheme Based on Wavelet TechniqueCSCJournals
In this paper, an image watermarking scheme based on multi bands wavelet transformation method is proposed. At first, the proposed scheme is tested on the spatial domain (for both a non and semi blind techniques) in order to compare its results with a frequency domain. In the frequency domain, an adaptive scheme is designed and implemented based on the bands selection criteria to embed the watermark. These criteria depend on the number of wavelet passes. In this work three methods are developed to embed the watermark (one band (LL|HH|HL|LH), two bands (LL&HH | LL&HL | LL&LH | HL&LH | HL&HH | LH&HH) and three bands (LL&HL&LH | LL&HH&HL | LL&HH&LH | LH&HH&HL) selection. The analysis results indicate that the performance of the proposed watermarking scheme for the non-blind scheme is much better than semi-blind scheme in terms of similarity of extracted watermark, while the security of semi-blind is relatively high. The results show that in frequency domain when the watermark is added to the two bands (HL and LH) for No. of pass =3 led to good correlation between original and extracted watermark around (similarity = 99%), and leads to reconstructed images of good objective quality (PSNR=24 dB) after JPEG compression attack (QF=25). The disadvantage of the scheme is the involvement of a large number of wavelet bands in the embedding process.
High Speed and Area Efficient 2D DWT Processor Based Image Compressionsipij
This paper presents a high speed and area efficient DWT processor based design for Image Compression applications. In this proposed design, pipelined partially serial architecture has been used to enhance the speed along with optimal utilization and resources available on target FPGA. The proposed model has been designed and simulated using Simulink and System Generator blocks, synthesized with Xilinx Synthesis tool (XST) and implemented on Spartan 2 and 3 based XC2S100-5tq144 and XC3S500E-4fg320 target device. The results show that proposed design can operate at maximum frequency 231 MHz in case of Spartan 3 by consuming power of 117mW at 28 degree/c junction temperature. The result comparison has shown an improvement of 15% in speed.
Steganography is a best method for in secret communicating information during the transference of data. Images are an appropriate method that used in steganography can be used to protection the simple bits and pieces. Several systems, this one as color scale images steganography and grayscale images steganography, are used on color and store data in different techniques. These color images can have very big amounts of secret data, by using three main color modules. The different color modules, such as HSV-(hue, saturation, and value), RGB-(red, green, and blue), YCbCr-(luminance and chrominance), YUV, YIQ, etc. This paper uses unusual module to hide data: an adaptive procedure that can increase security ranks when hiding a top secret binary image in a RGB color image, which we implement the steganography in the YCbCr module space. We performed Exclusive-OR (XOR) procedures between the binary image and the RGB color image in the YCBCR module space. The converted byte stored in the 8-bit LSB is not the actual bytes; relatively, it is obtained by translation to another module space and applies the XOR procedure. This technique is practical to different groups of images. Moreover, we see that the adaptive technique ensures good results as the peak signal to noise ratio (PSNR) and stands for mean square error (MSE) are good. When the technique is compared with our previous works and other existing techniques, it is shown to be the best in both error and message capability. This technique is easy to model and simple to use and provides perfect security with unauthorized.
Secure IoT Systems Monitor Framework using Probabilistic Image EncryptionIJAEMSJORNAL
In recent years, the modeling of human behaviors and patterns of activity for recognition or detection of special events has attracted considerable research interest. Various methods abounding to build intelligent vision systems aimed at understanding the scene and making correct semantic inferences from the observed dynamics of moving targets. Many systems include detection, storage of video information, and human-computer interfaces. Here we present not only an update that expands previous similar surveys but also a emphasis on contextual abnormal detection of human activity , especially in video surveillance applications. The main purpose of this survey is to identify existing methods extensively, and to characterize the literature in a manner that brings to attention key challenges.
Coronary heart disease is a disease with the highest mortality rates in the world. This makes the development of the diagnostic system as a very interesting topic in the field of biomedical informatics, aiming to detect whether a heart is normal or not. In the literature there are diagnostic system models by combining dimension reduction and data mining techniques. Unfortunately, there are no review papers that discuss and analyze the themes to date. This study reviews articles within the period 2009-2016, with a focus on dimension reduction methods and data mining techniques, validated using a dataset of UCI repository. Methods of dimension reduction use feature selection and feature extraction techniques, while data mining techniques include classification, prediction, clustering, and association rules.
Key frame extraction is an essential technique in the computer vision field. The extracted key frames should brief the salient events with an excellent feasibility, great efficiency, and with a high-level of robustness. Thus, it is not an easy problem to solve because it is attributed to many visual features. This paper intends to solve this problem by investigating the relationship between these features detection and the accuracy of key frames extraction techniques using TRIZ. An improved algorithm for key frame extraction was then proposed based on an accumulative optical flow with a self-adaptive threshold (AOF_ST) as recommended in TRIZ inventive principles. Several video shots including original and forgery videos with complex conditions are used to verify the experimental results. The comparison of our results with the-state-of-the-art algorithms results showed that the proposed extraction algorithm can accurately brief the videos and generated a meaningful compact count number of key frames. On top of that, our proposed algorithm achieves 124.4 and 31.4 for best and worst case in KTH dataset extracted key frames in terms of compression rate, while the-state-of-the-art algorithms achieved 8.90 in the best case.
A Survey On Thresholding Operators of Text Extraction In VideosCSCJournals
ideo indexing is an important problem that has interested by the communities of visual information in image processing. The detection and extraction of scene and caption text from unconstrained, general purpose video is an important research problem in the context of content-based retrieval and summarization. In this paper, the technique presented is for detection text from frames video. Finding the textual contents in images is a challenging and promising research area in information technology. Consequently, text detection and recognition in multimedia had become one of the most important fields in computer vision due to its valuable uses in a variety of recent technical applications. The work in this paper consists using morphological operations for extract text appearing in the video frames. The proposed scheme well as preprocessing to differentiate among where it as the high similarity between text and background information. Experimental results show that the resultant image is the image with only text. The evaluated criteria are applied with the image result and one obtained bay different operator.
A Survey On Thresholding Operators of Text Extraction In VideosCSCJournals
Video indexing is an important problem that has interested by the communities of visual information in image processing. The detection and extraction of scene and caption text from unconstrained, general purpose video is an important research problem in the context of content-based retrieval and summarization. In this paper, the technique presented is for detection text from frames video. Finding the textual contents in images is a challenging and promising research area in information technology. Consequently, text detection and recognition in multimedia had become one of the most important fields in computer vision due to its valuable uses in a variety of recent technical applications. The work in this paper consists using morphological operations for extract text appearing in the video frames. The proposed scheme well as preprocessing to differentiate among where it as the high similarity between text and background information. Experimental results show that the resultant image is the image with only text. The evaluated criteria are applied with the image result and one obtained bay different operator.
A Framework for Human Action Detection via Extraction of Multimodal FeaturesCSCJournals
This work discusses the application of an Artificial Intelligence technique called data extraction and a process-based ontology in constructing experimental qualitative models for video retrieval and detection. We present a framework architecture that uses multimodality features as the knowledge representation scheme to model the behaviors of a number of human actions in the video scenes. The main focus of this paper placed on the design of two main components (model classifier and inference engine) for a tool abbreviated as VASD (Video Action Scene Detector) for retrieving and detecting human actions from video scenes. The discussion starts by presenting the workflow of the retrieving and detection process and the automated model classifier construction logic. We then move on to demonstrate how the constructed classifiers can be used with multimodality features for detecting human actions. Finally, behavioral explanation manifestation is discussed. The simulator is implemented in bilingual; Math Lab and C++ are at the backend supplying data and theories while Java handles all front-end GUI and action pattern updating. To compare the usefulness of the proposed framework, several experiments were conducted and the results were obtained by using visual features only (77.89% for precision; 72.10% for recall), audio features only (62.52% for precision; 48.93% for recall) and combined audiovisual (90.35% for precision; 90.65% for recall).
A Hardware Model to Measure Motion Estimation with Bit Plane Matching AlgorithmTELKOMNIKA JOURNAL
The multistep approach involving combination of techniques is referred as motion estimation.
The proposed approach is an adaptive control system to measure the motion from starting point to limit of
search. The motion patterns are used to analyze and avoid stationary regions of image. The algorithm
proposed is robust efficient and the calculations justify its advantages. The motivation of the work is to
maximize the encoding speed and visual quality with the help of motion vector algorithm. In this work a
hardware model is developed in which a frame of pictures are captured and sent via serial port to the system.
MATLAB simulation tool is used to detect the motion among the picture frame. Once any motion is detected
that signal is sent to the hardware which will give the appropriate sign accordingly. This system is developed
on two platforms (hardware as well software) to estimate and measure the motion vectors
Automated Traffic sign board classification system is one of the key technologies of Intelligent
Transportation Systems (ITS). Traffic Surveillance System is being more and important with improving
urban scale and increasing number of vehicles. This Paper presents an intelligent sign board
classification method based on blob analysis in traffic surveillance. Processing is done by three main
steps: moving object segmentation, blob analysis, and classifying. A Sign board is modelled as a
rectangular patch and classified via blob analysis. By processing the blob of sign boards, the meaningful
features are extracted. Tracking moving targets is achieved by comparing the extracted features with
training data. After classifying the sign boards the system will intimate to user in the form of alarms,
sound waves. The experimental results show that the proposed system can provide real-time and useful
information for traffic surveillance.
Video Shot Boundary Detection Using The Scale Invariant Feature Transform and...IJECEIAES
Segmentation of the video sequence by detecting shot changes is essential for video analysis, indexing and retrieval. In this context, a shot boundary detection algorithm is proposed in this paper based on the scale invariant feature transform (SIFT). The first step of our method consists on a top down search scheme to detect the locations of transitions by comparing the ratio of matched features extracted via SIFT for every RGB channel of video frames. The overview step provides the locations of boundaries. Secondly, a moving average calculation is performed to determine the type of transition. The proposed method can be used for detecting gradual transitions and abrupt changes without requiring any training of the video content in advance. Experiments have been conducted on a multi type video database and show that this algorithm achieves well performances.
Attention correlated appearance and motion feature followed temporal learning...IJECEIAES
Recent advances in deep neural networks have been successfully demonstrated with fairly good accuracy for multi-class activity identification. However, existing methods have limitations in achieving complex spatial-temporal dependencies. In this work, we design two stream fusion attention (2SFA) connected to a temporal bidirectional gated recurrent unit (GRU) one-layer model and classified by prediction voting classifier (PVC) to recognize the action in a video. Particularly in the proposed deep neural network (DNN), we present 2SFA for capturing appearance information from red green blue (RGB) and motion from optical flow, where both streams are correlated by proposed fusion attention (FA) as the input of a temporal network. On the other hand, the temporal network with a bi-directional temporal layer using a GRU single layer is preferred for temporal understanding because it yields practical merits against six topologies of temporal networks in the UCF101 dataset. Meanwhile, the new proposed classifier scheme called PVC employs multiple nearest class mean (NCM) and the SoftMax function to yield multiple features outputted from temporal networks, and then votes their properties for high-performance classifications. The experiments achieve the best average accuracy of 70.8% in HMDB51 and 91.9%, the second best in UCF101 in terms of 2DConvNet for action recognition.
Residual balanced attention network for real-time traffic scene semantic segm...IJECEIAES
Intelligent transportation systems (ITS) are among the most focused research in this century. Actually, autonomous driving provides very advanced tasks in terms of road safety monitoring which include identifying dangers on the road and protecting pedestrians. In the last few years, deep learning (DL) approaches and especially convolutional neural networks (CNNs) have been extensively used to solve ITS problems such as traffic scene semantic segmentation and traffic signs classification. Semantic segmentation is an important task that has been addressed in computer vision (CV). Indeed, traffic scene semantic segmentation using CNNs requires high precision with few computational resources to perceive and segment the scene in real-time. However, we often find related work focusing only on one aspect, the precision, or the number of computational parameters. In this regard, we propose RBANet, a robust and lightweight CNN which uses a new proposed balanced attention module, and a new proposed residual module. Afterward, we have simulated our proposed RBANet using three loss functions to get the best combination using only 0.74M parameters. The RBANet has been evaluated on CamVid, the most used dataset in semantic segmentation, and it has performed well in terms of parameters’ requirements and precision compared to related work.
In this work, we define a new method for indexing and retrieving non-geotagged
video sequences based on the visual content only by using the Local Binary Pattern
(LBP) and Singular Value Decomposition (SVD) techniques. The main question of our
system, Is it possible to determine the geographic location of a video film on the GISmap
from just its pixels of frames?. The proposed system is introduced to answer the
questions like that. The GIS database was constructed by storing the reference images
on the intersection between segment roads in the map. The Local Binary Pattern
(LBP) is used to extract the features form images. The Singular Value Decomposition
(SVD) technique is used for compress the length of features and indexing the images
in the database. The input to the system is a video taken from the camera puts on a
vehicle as forward facing camera. The output of the proposed system is the geolocation
of keyframes of video which correspond the geo-tagged images retrieved
from the GIS database.
Similar to Key Frame Extraction for Salient Activity Recognition (20)
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Key Frame Extraction for Salient Activity Recognition
1. Key Frame Extraction for Salient Activity
Recognition
Sourabh Kulhare∗, Shagan Sah†, Suhas Pillai‡, Raymond Ptucha∗
∗Computer Engineering †Center for Imaging Science ‡Computer Science
Rochester Institute of Technology, Rochester, USA
Email: sk1846@rit.edu
Abstract—Surveillance cameras have become big business, with
most metropolitan cities spending millions of dollars to watch
residents, both from street corners, public transportation hubs,
and body cameras on officials. Watching and processing the
petabytes of streaming video is a daunting task, making auto-
mated and user assisted methods of searching and understanding
videos critical to their success. Although numerous techniques
have been developed, large scale video classification remains a
difficult task due to excessive computational requirements. In
this paper, we conduct an in-depth study to investigate effective
architectures and semantic features for efficient and accurate
solutions to activity recognition. We investigate different color
spaces, optical flow, and introduce a novel deep learning fusion
architecture for multi-modal inputs. The introduction of key
frame extraction, instead of using every frame or a random
representation of video data, make our methods computationally
tractable. Results further indicate that transforming the image
stream into a compressed color space reduces computational
requirements with minimal affect on accuracy.
I. INTRODUCTION
Decreasing hardware costs, advanced functionality and pro-
lific use of video in the judicial system has recently caused
video surveillance to spread from traditional military, retail,
and large scale metropolitan applications to every day activ-
ities. For example, most homeowner security systems come
with video options, cameras in transportation hubs and high-
ways report congestion, retail surveillance has been expanded
for targeted marketing, and even small suburbs, such as the
quiet town of Elk Grove, CA utilize cameras to detect and
deter petty crimes in parks and pathways.
Detection and recognition of objects and activities in video
is critical to expanding the functionality of these systems.
Advanced activity understanding can significantly enhance
security details in airports, train stations, markets, and sports
stadiums, and can provide peace of mind to homeowners,
Uber drivers, and officials with body cameras. Security officers
can do an excellent job at detecting and annotating relevant
information, however they simply cannot keep up with the
terabytes of video being uploaded on a daily basis. Automated
activity analysis can scrutinize every frame, databasing a
plethora of object, activity, and scene based information for
later analysis. To achieve this goal, there is a substantial need
for the development of effective and efficient automated tools
for video understanding.
Conventional methods use hand-crafted features such as
motion SIFT [1] or HOG [2] to classify actions of small
Fig. 1. Illustration of Key Frame extraction work flow using the optical flow.
The key frames are inputs to independent color and motion CNN’s.
temporal extent. Recent successes of deep learning [3] [4]
[5] in the still image domain have influenced video research.
Researchers have introduced varying color spaces [6] , optical
flow [7], and implemented clever architectures [8] to fuse dis-
parate inputs. This study analyzes the usefulness of the varying
input channels, utilizes key frame extraction for efficacy, and
introduces a multi-modal deep learning fusion architecture for
state-of-the-art activity recognition.
Large scale video classification remains a difficult task
due to excessive computational requirements. Karpathy et al.
[8] proposed a number of techniques for fusion of temporal
information. However, these techniques process sample frames
selected randomly from full length video. Such random selec-
tion of samples may not take into account all useful motion
and spatial information. Simonyan and Zisserman [7] used
optical flow to represent the motion information to achieve
high accuracy, but with steep computational requirements. For
example, they reported that the optical flow data on 13K video
snippets was 1.5 TB.
To minimize compute resources, this work first computes
key frames of a video, and then analyzes key frames and
their neighbors as a surrogate for analyzing all frames in a
video. Key frames, or the important frames in a video, can
2. form a storyboard, in which a subset of frames are used to
represent the content of a video. We hypothesize that deep
learning networks can learn the context of the video using
the neighbors of key frames. Voting on key frame regions
then determines the temporal activity of a video snippet.
Some events can be represented by fewer key frames whereas
complex activities might require significantly more key frames.
The main advantage with this approach is the selection of
frames which depend on context of the video and hence
overcome the requirement to train a network on every frame
of a video.
In this work, we also experimented with multi-stream Con-
volutional Neural Network (CNN) architectures. Our multi-
stream CNN architecture is biologically inspired by the human
primary cortex. The human mind has always been a prime
source of inspiration for various effective architectures such
as Neocognitron [9] and HMAX models [10]. These models
use pre-defined spatio-temporal filters in the first layer and
later combine them to form spatial (ventral-like) and tem-
poral (dorsal-like) recognition systems. Similarly in multi-
stream networks, each individual slice of a convolution layer
is dedicated to one type of data representation and passed
concurrently with all other representations.
From the experimental perspective, we are interested in an-
swering the following questions: 1) Does the fusion of multiple
color spaces perform better than a single color space?; 2) How
can one process less amount of data while maintaining model
performance?; and 3) What is the best combination of color
spaces and optical flow for better activity recognition? We
address these questions by utilizing deep learning techniques
for large scale video classification. Our work does not focus
on competing state-of-the-art accuracy rather we are interested
in evaluating the architectural performance while combining
different color spaces over key-frame based video frame
selection. We extended the two stream CNN implementation
proposed by Simonyan and Zisserman [7] to a multi-stream
architecture. Streams are categorized into color streams and
temporal streams, where color streams are further divided
based on color spaces. The color streams use RGB and YCbCr
color spaces. YCbCr color space has been extremely useful
for video/image compression techniques. In the first spatial
stream, we process the luma and chroma components of the
key frames. Chroma components are optionally downsampled
and integrated in the network at a later stage. The architecture
is defined such that both luma and chroma components train a
layer of convolutional filters together as a concatenated array
before the fully connected layers. Apart from color informa-
tion, optical flow data is used to represent motion. Optical
flow has been a widely accepted representation of motion, our
multi-stream architecture contains dedicated stream for optical
flow data.
The main contributions of this paper are to understand
the individual and combinational contributions of different
video data representations while efficiently processing only
around key frames instead of randomly or sequentially selected
frames. We evaluated our methods on various CNN architec-
tures to account for color, object and motion contributions
from video data.
The rest of the paper is organized as follows. After the
introduction, Section 2 overviews the related work in activity
recognition, various deep learning approaches for motion
estimation and video analysis. Section 3 introduces our salient
activity recognition framework and multi-stream architectures
with multiple combinations of different data representations.
Section 4 presents the experimental results, Section 5 contains
experimental findings and analysis of potential improvement
and Section 6 contains concluding remarks.
II. RELATED WORK
Video classification has been a longstanding research topic
in the multimedia processing and computer vision fields.
Efficient and accurate classification performance relies on the
extraction of salient video features. Conventional methods for
video classification [11] involve generation of video descrip-
tors that encode both spatial and motion variance information
into hand-crafted features such as Histogram of Oriented
Gradients (HOG) [2], Histogram of Optical Flow (HOF) [12],
and spatio-temporal interest points [13]. These features are
then encoded as a global representation through bag of words
[13] or fisher vector based encoding [14] and then passed to
a classifier [15].
Video classification research has been influenced by recent
trends in machine learning which utilize deep architectures for
image classification [3], object detection [16] [17], scene la-
beling [18] and action classification [19] [8]. A common work-
flow in these works is to use group of frames as the input to the
network, whereby the model is expected to learn spatial, color,
spatio-temporal and motion characteristics. [20] introduces 3D
CNN models for action recognition with temporal data. This
extracts features from both spatial and temporal domain by
performing 3D convolution. Karpathy [8] compared different
fusion combinations for temporal data on very large datasets.
They [8] state that stacking of frames over time gives similar
results as treating them individually, indicating that spatial and
temporal data may need to be treated separately. Unsupervised
learning methods such as Convolutional Gated Restricted
Boltzmann Machines [21] and Independent Subspace Analysis
[22] have also shown promising results. Recent work by
Simonyan and Zisserman [7] decomposes video into spatial
and temporal components. The spatial component works with
scene and object information in each frame. The temporal
component signifies motion across frames. Ng et al. [23]
evaluated the effect of different color space representations
on the classification of gender. Interestingly, they presented
that gray scale performed better than RGB and YCbCr space.
III. METHODOLOGY
In this section, we describe our learning model for large
scale video classification including pre-processing, multi-
stream CNN, key frame selection and the training procedure
in detail. At test time, only the key frames of a test video
are passed through the CNN and classified into one of the
3. activities. This helps to not only show that key frames are
capturing the important parts of the video but also that the
testing is faster as compared to passing all frames though the
CNN.
A. Color Stream
Video data can be naturally decomposed into spatial and
temporal information. The most common spatial representation
of video frames is the RGB (3-channel) data. In this study,
we compare it with the Luminance and Chrominance color
space and their combinations thereof. YCbCr space separates
the color into the luminance channel (Y), the blue-difference
channel (Cb), and the red-difference channel (Cr).
The spatial representation of the activity contains global
scene and object attributes such as shape, color and texture.
The CNN filters in the color stream learn the color and edge
features from the scene. The human visual system has lower
acuity for color differences than luminance detail. Image and
video compression techniques take advantage of this phe-
nomenon, where the conversion of RGB primaries to luma and
chroma allow for chroma sub-sampling. We use this concept
while formulating our multi-stream CNN architectures. We
sub-sample the chrominance channels by factors of 4 and 16
to test the contribution of color to the framework.
B. Motion Stream
Motion is an intrinsic property of a video that describes an
action by a sequence of frames, where the optical flow could
depict the motion of temporal change. We use an OpenCV
implementation [24] of optical flow to estimate motion in a
video. Similar to [25], we stack the optical flow in the x- and
y- directions. We scale these by a factor of 16 and stack the
magnitude as the third channel.
C. Key Frame Extraction
We use the optical flow displacement fields between consec-
utive frames and detect motion stillness to identify key frames.
A hierarchical time constraint ensures that fast movement
activities are not omitted. The first step in identifying key
frames is the calculation of optical flow for the entire video
and estimate the magnitude of motion using a motion metric
as a function of time [26]. The function is calculated by
aggregating the optical flow in the horizontal and vertical
direction over all the pixels in each frame. This is represented
in (1).
M(t) =
i j
| OFx(i, j, t) | + | OFy(i, j, t) | (1)
where OFx(i, j, t) is the x component of optical flow at
pixel i, j in frame t, and similarly for y component. As optical
flow tracks all points over time, the sum is an estimation
of the amount of motion between frames. The gradient of
this function is the change of motion between consecutive
frames and hence the local minimas and maximas would
represent stillness or important activities between sequences
of actions. An example of this gradient change from a UCF-
101 [27] video is shown in Figure 2. For capturing fast moving
activities, a temporal constraint between two selected frames is
applied during selection [28]. Frames are dynamically selected
depending on the content of the video. Hence, complex activ-
ities or events would have more key frames, whereas simpler
ones may have less.
Fig. 2. Example of selected key frames for a boxing video from UCF-101
dataset. The red dots are local minima while the green are local maxima.
Fig. 3. Example of selected key frames for a boxing video from UCF-101
dataset.
D. Early Fusion
We consider a video as a bag of short clips, where each
clip is a collection of 10 frames. While preparing the input
data for the CNN models, we stacked these 10 frames, the
RGB/Y/CbCr to generate 30/10/20 channels, respectively. We
select a key frame, then define a clip as the four preceding
frames and five proceeding frames relative to the key frame.
The early fusion technique combines the entire 10 frame
time window of the filters from the first convolution layer
of the CNN. We adapt the time constraint by modifying the
dimension of the these filters as F × F × CT, where F is
the filter dimension, T is the time window (10) and C is
the number of channels in the input (3 for RGB). This is an
4. alternate representation from the more common 4-dimensional
convolution.
E. Multi-Stream Architecture
We propose a multi-stream architecture which combines
the color-spatial and motion channels. Figure 4 illustrates
an example of the multi-stream architecture. The individual
streams have multi-channel inputs, both in terms of color
channels and time windows. We let the first three convolutional
layers learn independent features but share the filters for the
last two layers.
F. Training
Our baseline architecture is similar to [3], but accepts inputs
with multiple stacked frames. Consequently, our CNN models
accept data, which has temporal information stored in the third
dimension. For example, the luminance stream accepts input
as a short clip of dimensions 224×224×10. The architec-
ture can be represented as C(64,11,4)-BN-P-C(192,5,1)-BN-P-
C(384,3,1)-BN-C(256,3,1)-BN-P-FC(4096)-FC(4096), where
C(d,f,s) indicates a convolution layer with d number of filters
of size f×f with stride of s. P signifies max pooling layer with
3×3 region and stride of 2. BN denotes batch normalization
[29] layers. The learning rate was initialized at 0.001 and
adaptively gets updated based on the loss per mini batch. The
momentum and weight decay were 0.9 and 5e−4
, respectively.
The native resolution of the videos was 320 × 240. Each
frame was center cropped to 240 × 240, then resized to 224
× 224. Each sample was normalized by mean subtraction and
divided by standard deviation across all channels.
IV. RESULTS
A. Dataset
Experiments were performed on UCF-101 [27], one of the
largest annotated video datasets with 101 different human
actions. It contains 13K videos, comprising 27 hours of video
data. The dataset contains realistic videos with natural variance
in camera motion, object appearance, pose and object scale.
It is a challenging dataset composed of unconstrained videos
downloaded from YouTube which incorporate real world
challenges such as poor lighting, cluttered background and
severe camera motion. We used UCF-101 split-1 to validate
our methodologies. Experiments deal with two classes of
data representation; key frame data and sequential data. Key
frame data includes clips extracted around key frames where
sequential data signifies 12 clips extracted around 12 equally
spaced frames across the video. 12 equally spaced frames were
chosen as that was the average number of key frames extracted
per video. We will use the terms key frame data and sequential
data to represent the extraction of frame locations.
B. Evaluation
The model generates a predicted activity at each selected
frame location, and voting amongst all locations in a video
clip is used for video level accuracy. Although transfer
learning boosted RGB and optical flow data performance,
no high performing YCbCr transfer learning models were
available. To ensure fair comparison among methods, all
model results were initialized with random weights.
The first set of experiments quantify the value of using
key frames. Table I shows that key frame data consistently
outperforms the sequential data representation. Table II, which
uses two stream architectures, similarly shows that key frame
data is able to understand video content more accurately
than sequential data. These experiments validate that there
is significant informative motion and spatial information
available around key frames.
TABLE I
SINGLE STREAM EXPERIMENT RESULTS.
Data
Sequential Key Frames
Accuracy
1 Y-only 39.72 % 42.04%
2 CbCr-only 35.04 % 35.04 %
3 RGB-only 38.44 % 46.04 %
4 OF-only 42.90 % 46.54 %
Table I shows that optical flow data is perhaps the single best
predictor. Optical flow data contains very rich information for
motion estimation, which is important for activity recognition.
Parameter training with three channel optical flow represen-
tation required less computational resources because it repre-
sents information of 10 video frames with only 224×224×3
size of data. The ten frame stacked RGB-only model (10× the
1st
layer memory of OF-only) resulted in similar accuracy,
but took three more days to train than the optical flow model.
The luminance only and chrominance only models gave less
promising results.
TABLE II
TWO STREAM EXPERIMENT RESULTS.
Data
Sequential Key Frames
Accuracy
1 Y+CbCr 45.30 % 47.13 %
2 Y+CbCr/4 - 43.40 %
3 Y+CbCr/16 - 42.77 %
4 Y+OF 41.68 % 44.24 %
TABLE III
MULTI-STREAM EXPERIMENT RESULTS. *EPOCHS = 15
Data
Sequential Key Frames
Accuracy
1 Y+CbCr+OF 48.13 % 49.23 %
2 Y+OF+RGB 45.33* % 46.46* %
Table II demonstrates multiple channel results. The fusion of
luminance data with chrominance data is the best performing
5. Fig. 4. Multi-stream architecture overview. The inputs are 224 x 224 images with different number of channels. Ti is the temporal input channels.
dual stream model. CNNs can take weeks to learn over large
datasets, even when using optimized GPU implementations.
One particular factor strongly correlated with training time
is pixel resolution. It has long been known that humans see
high resolution luminance and low resolution chrominance. To
determine if CNNs can learn with low resolution chrominance,
the chrominance channels were subsampled by a factor of four
and sixteen. As shown in the Table II, lowering chrominance
resolution did not have a big impact on accuracy. Despite
this small change in accuracy, the training time was reduced
dramatically.
To further understand what combination of channel repre-
sentations will provide best activity understanding, Table III
contrasts three stream CNN architectures. Once again, the
usage of YCbCr is superior to RGB, with a 47.73% top-1
accuracy on UCF-101.
C. Visualization of Temporal Convolution Filters
Fig. 5. Visualization of six 11 x 11 convolutional filters from first layer.
(a) shows luminance filters and (b) shows filters from optical flow. Rows
correspond to separate filters with separate channels (10 for luminance and 3
for optical flow).
Figure 5 illustrates examples of trained (11×11) filters in the
first convolutional layer. The luminance filters are 10 channels
and the optical flow filters are x-, y- and magnitude. It can be
observed that the filters capture the motion change over the x-
and y- directions. These filters allow the network to precisely
detect local motion direction and velocity.
V. DISCUSSION
Deep learning models contain a large number of parameters,
and as a result are prone to overfitting. A dropout [30] ratio
of 0.5 was used in all experiments to reduce the impact of
overfitting. Trying a higher dropout ratio may help the model
to generalize well, as our learning curves indicate the UCF
101 data may be overfitting. We used batch normalization
[31], which has shown to train large networks fast with higher
accuracy. As the data flows through the deep network, the
weights and parameters adjust the data to minimize internal
covariance shift between layers. Batch normalization reduces
this internal covariance shift by normalizing the data at every
mini-batch, giving a boost in training accuracy, especially on
large datasets.
For multi-stream experiments, we experimented with trans-
fer learning and fine tuned the last few layers of the network.
Unfortunately, there were no pretrained models for YCbCr
data. A color conversion of pretrained RGB filters to YCbCr
filters yielded low YCbCr accuracy. As a result, we trained all
models from scratch for a fair comparison.
We also experimented with Motion History Images (MHI)
in place of optical flow. A MHI template collapses motion
information into a single gray scale frame, where intensity of
a pixel is directly related to recent pixel motion. Single stream
MHI resulted 26.7 % accuracy. This lower accuracy might be
improved by changing the fixed time parameter during the
estimation of motion images; we used ten frames to generate
one motion image.
Our main goal was to experiment with different fusion
techniques and key frames, so we did not apply any data
augmentation. All results in Tables I through III, except
for the Y+OF+RGB, trained for 30 epochs so that we can
compare performance on the same scale. The Y+OF+RGB
model was trained for 15 epochs. We did observe the trend that
running with higher number of epochs increased the accuracy
significantly. For example, the single stream OF-only with key
frames in Table I jumped to 57.8% after 117 epochs.
VI. CONCLUSION
We propose a novel approach to fuse color spaces and
optical flow information in a single convolutional neural
network architecture for state-of-the-art activity recognition.
This study shows that color and motion cues are necessary and
6. their combination is preferred for accurate action detection. We
studied the performance of key frames over sequentially se-
lected video clips for large scale human activity classification.
We experimentally support that smartly selected key frames
add valuable data to CNNs and hence perform better than
conventional sequential or randomly selected clips. Using key
frames not only provides better results but can significantly
reduce the amount of data being processed. To further reduce
computational resources, multi-stream experiments advocate
that lowering down the resolution of chrominance data stream
does not harm performance significantly. Our results indicate
that passing optical flow and YCbCr data into our multi-
stream architecture at key frame locations of videos offer
comprehensive feature learning, which may lead to better
understanding of human activity.
REFERENCES
[1] D. G. Lowe, “Object recognition from local scale-invariant features,” in
Computer vision, 1999. The proceedings of the seventh IEEE interna-
tional conference on, vol. 2. Ieee, 1999, pp. 1150–1157.
[2] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in Computer Vision and Pattern Recognition, 2005. CVPR
2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp.
886–893.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
mation processing systems, 2012, pp. 1097–1105.
[4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2015, pp. 1–9.
[5] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily
fooled: High confidence predictions for unrecognizable images,” in Com-
puter Vision and Pattern Recognition (CVPR), 2015 IEEE Conference
on. IEEE, 2015, pp. 427–436.
[6] S. Chaabouni, J. Benois-Pineau, O. Hadar, and C. B. Amar, “Deep
learning for saliency prediction in natural video,” 2016.
[7] K. Simonyan and A. Zisserman, “Two-stream convolutional networks
for action recognition in videos,” in Advances in Neural Information
Processing Systems, 2014, pp. 568–576.
[8] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
L. Fei-Fei, “Large-scale video classification with convolutional neural
networks,” in Proceedings of the IEEE conference on Computer Vision
and Pattern Recognition, 2014, pp. 1725–1732.
[9] K. Fukushima, “Neocognitron: A self-organizing neural network model
for a mechanism of pattern recognition unaffected by shift in position,”
Biological cybernetics, vol. 36, no. 4, pp. 193–202, 1980.
[10] M. Riesenhuber and T. Poggio, “Hierarchical models of object recog-
nition in cortex,” Nature neuroscience, vol. 2, no. 11, pp. 1019–1025,
1999.
[11] J. Liu, J. Luo, and M. Shah, “Recognizing realistic actions from videos
in the wild,” in Computer Vision and Pattern Recognition, 2009. CVPR
2009. IEEE Conference on. IEEE, 2009, pp. 1996–2003.
[12] R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal, “Histograms of
oriented optical flow and binet-cauchy kernels on nonlinear dynamical
systems for the recognition of human actions,” in Computer Vision and
Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE,
2009, pp. 1932–1939.
[13] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning
realistic human actions from movies,” in Computer Vision and Pattern
Recognition, 2008. CVPR 2008. IEEE Conference on, June 2008, pp.
1–8.
[14] H. Wang and C. Schmid, “Action recognition with improved trajecto-
ries,” in 2013 IEEE International Conference on Computer Vision, Dec
2013, pp. 3551–3558.
[15] H. Wang, A. Klser, C. Schmid, and C. L. Liu, “Action recognition by
dense trajectories,” in Computer Vision and Pattern Recognition (CVPR),
2011 IEEE Conference on, June 2011, pp. 3169–3176.
[16] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2014, pp. 580–587.
[17] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Le-
Cun, “Overfeat: Integrated recognition, localization and detection using
convolutional networks,” arXiv preprint arXiv:1312.6229, 2013.
[18] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical
features for scene labeling,” Pattern Analysis and Machine Intelligence,
IEEE Transactions on, vol. 35, no. 8, pp. 1915–1929, 2013.
[19] H. Jhuang, T. Serre, L. Wolf, and T. Poggio, “A biologically inspired
system for action recognition,” in Computer Vision, 2007. ICCV 2007.
IEEE 11th International Conference on. Ieee, 2007, pp. 1–8.
[20] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for
human action recognition,” Pattern Analysis and Machine Intelligence,
IEEE Transactions on, vol. 35, no. 1, pp. 221–231, 2013.
[21] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler, “Convolutional
learning of spatio-temporal features,” in Computer Vision–ECCV 2010.
Springer, 2010, pp. 140–153.
[22] Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng, “Learning hierarchical
invariant spatio-temporal features for action recognition with indepen-
dent subspace analysis,” in Computer Vision and Pattern Recognition
(CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. 3361–3368.
[23] C.-B. Ng, Y.-H. Tay, and B.-M. Goi, “Comparing image representa-
tions for training a convolutional neural network to classify gender,”
in Artificial Intelligence, Modelling and Simulation (AIMS), 2013 1st
International Conference on. IEEE, 2013, pp. 29–33.
[24] G. Farneb¨ack, “Two-frame motion estimation based on polynomial
expansion,” in Image analysis. Springer, 2003, pp. 363–370.
[25] G. Gkioxari and J. Malik, “Finding action tubes,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2015,
pp. 759–768.
[26] W. Wolf, “Key frame selection by motion analysis,” in Acoustics, Speech,
and Signal Processing, 1996. ICASSP-96. Conference Proceedings.,
1996 IEEE International Conference on, vol. 2. IEEE, 1996, pp. 1228–
1231.
[27] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human
actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402,
2012.
[28] A. Girgensohn and J. Boreczky, “Time-constrained keyframe selection
technique,” in Multimedia Computing and Systems, 1999. IEEE Inter-
national Conference on, vol. 1. IEEE, 1999, pp. 756–761.
[29] S. Ioffe and C. Szegedy, “Batch normalization: Accelerat-
ing deep network training by reducing internal covariate
shift,” CoRR, vol. abs/1502.03167, 2015. [Online]. Available:
http://arxiv.org/abs/1502.03167
[30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
dinov, “Dropout: A simple way to prevent neural networks from over-
fitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp.
1929–1958, 2014.
[31] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” arXiv preprint
arXiv:1502.03167, 2015.