For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2021/10/person-re-identification-and-tracking-at-the-edge-challenges-and-techniques-a-presentation-from-the-university-of-auckland/
Morteza Biglari-Abhari, Senior Lecturer at the University of Auckland, presents the “Person Re-Identification and Tracking at the Edge: Challenges and Techniques” tutorial at the May 2021 Embedded Vision Summit.
Numerous video analytics applications require understanding how people are moving through a space, including the ability to recognize when the same person has moved outside of the camera’s view and then back into the camera’s view, or when a person has passed from the view of one camera to the view of another. This capability is referred to as person re-identification and tracking. It’s an essential technique for applications such as surveillance for security, health and safety monitoring in healthcare and industrial facilities, intelligent transportation systems and smart cities. It can also assist in gathering business intelligence such as monitoring customer behavior in shopping environments. Person re-identification is challenging.
In this talk, Biglari-Abhari discusses the key challenges and current approaches for person re-identification and tracking, as well as his initial work on multi-camera systems and techniques to improve accuracy, especially fusing appearance and spatio-temporal models. He also briefly discusses privacy-preserving techniques, which are critical for some applications, as well as challenges for real-time processing at the edge.
Dissertation synopsis for imagedenoising(noise reduction )using non local me...Arti Singh
Dissertation report for image denoising using non-local mean algorithm, discussion about subproblem of noise reduction,descrption for problem in image noise
Deconstructing SfM-Net architecture and beyond
"SfM-Net, a geometry-aware neural network for motion estimation in videos that decomposes frame-to-frame pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations. Given a sequence of frames, SfM-Net predicts depth, segmentation, camera and rigid object motions, converts those into a dense frame-to-frame motion field (optical flow), differentiably warps frames in time to match pixels and back-propagates."
Alternative download:
https://www.dropbox.com/s/aezl7ro8sy2xq7j/sfm_net_v2.pdf?dl=0
A NOVEL BACKGROUND SUBTRACTION ALGORITHM FOR PERSON TRACKING BASED ON K-NN csandit
Object tracking can be defined as the process of detecting an object of interest from a video scene and keeping track of its motion, orientation, occlusion etc. in order to extract useful
information. It is indeed a challenging problem and it’s an important task. Many researchers are getting attracted in the field of computer vision, specifically the field of object tracking in video surveillance. The main purpose of this paper is to give to the reader information of the present state of the art object tracking, together with presenting steps involved in Background Subtraction and their techniques. In related literature we found three main methods of object tracking: the first method is the optical flow; the second is related to the background subtraction, which is divided into two types presented in this paper, and the last one is temporal
differencing. We present a novel approach to background subtraction that compare a current frame with the background model that we have set before, so we can classified each pixel of the image as a foreground or a background element, then comes the tracking step to present our object of interest, which is a person, by his centroid. The tracking step is divided into two different methods, the surface method and the K-NN method, both are explained in the paper.Our proposed method is implemented and evaluated using CAVIAR database.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2021/10/person-re-identification-and-tracking-at-the-edge-challenges-and-techniques-a-presentation-from-the-university-of-auckland/
Morteza Biglari-Abhari, Senior Lecturer at the University of Auckland, presents the “Person Re-Identification and Tracking at the Edge: Challenges and Techniques” tutorial at the May 2021 Embedded Vision Summit.
Numerous video analytics applications require understanding how people are moving through a space, including the ability to recognize when the same person has moved outside of the camera’s view and then back into the camera’s view, or when a person has passed from the view of one camera to the view of another. This capability is referred to as person re-identification and tracking. It’s an essential technique for applications such as surveillance for security, health and safety monitoring in healthcare and industrial facilities, intelligent transportation systems and smart cities. It can also assist in gathering business intelligence such as monitoring customer behavior in shopping environments. Person re-identification is challenging.
In this talk, Biglari-Abhari discusses the key challenges and current approaches for person re-identification and tracking, as well as his initial work on multi-camera systems and techniques to improve accuracy, especially fusing appearance and spatio-temporal models. He also briefly discusses privacy-preserving techniques, which are critical for some applications, as well as challenges for real-time processing at the edge.
Dissertation synopsis for imagedenoising(noise reduction )using non local me...Arti Singh
Dissertation report for image denoising using non-local mean algorithm, discussion about subproblem of noise reduction,descrption for problem in image noise
Deconstructing SfM-Net architecture and beyond
"SfM-Net, a geometry-aware neural network for motion estimation in videos that decomposes frame-to-frame pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations. Given a sequence of frames, SfM-Net predicts depth, segmentation, camera and rigid object motions, converts those into a dense frame-to-frame motion field (optical flow), differentiably warps frames in time to match pixels and back-propagates."
Alternative download:
https://www.dropbox.com/s/aezl7ro8sy2xq7j/sfm_net_v2.pdf?dl=0
A NOVEL BACKGROUND SUBTRACTION ALGORITHM FOR PERSON TRACKING BASED ON K-NN csandit
Object tracking can be defined as the process of detecting an object of interest from a video scene and keeping track of its motion, orientation, occlusion etc. in order to extract useful
information. It is indeed a challenging problem and it’s an important task. Many researchers are getting attracted in the field of computer vision, specifically the field of object tracking in video surveillance. The main purpose of this paper is to give to the reader information of the present state of the art object tracking, together with presenting steps involved in Background Subtraction and their techniques. In related literature we found three main methods of object tracking: the first method is the optical flow; the second is related to the background subtraction, which is divided into two types presented in this paper, and the last one is temporal
differencing. We present a novel approach to background subtraction that compare a current frame with the background model that we have set before, so we can classified each pixel of the image as a foreground or a background element, then comes the tracking step to present our object of interest, which is a person, by his centroid. The tracking step is divided into two different methods, the surface method and the K-NN method, both are explained in the paper.Our proposed method is implemented and evaluated using CAVIAR database.
The determination of Region-of-Interest has been recognised as an important means by which
unimportant image content can be identified and excluded during image compression or image
modelling, however existing Region-of-Interest detection methods are computationally
expensive thus are mostly unsuitable for managing large number of images and the compression
of images especially for real-time video applications. This paper therefore proposes an
unsupervised algorithm that takes advantage of the high computation speed being offered by
Speeded-Up Robust Features (SURF) and Features from Accelerated Segment Test (FAST) to
achieve fast and efficient Region-of-Interest detection.
A common goal of the engineering field of signal processing is to reconstruct a signal from a series of sampling measurements. In general, this task is impossible because there is no way to reconstruct a signal during the times
that the signal is not measured. Nevertheless, with prior knowledge or assumptions about the signal, it turns out to
be possible to perfectly reconstruct a signal from a series of measurements. Over time, engineers have improved their understanding of which assumptions are practical and how they can be generalized. An early breakthrough in signal processing was the Nyquist–Shannon sampling theorem. It states that if the signal's highest frequency is less than half of the sampling rate, then the signal can be reconstructed perfectly. The main idea is that with prior knowledge about constraints on the signal’s frequencies, fewer samples are needed to reconstruct the signal. Sparse sampling (also known as, compressive sampling, or compressed sampling) is a signal processing technique for efficiently acquiring and reconstructing a signal, by finding solutions tounder determined linear systems. This is based on the principle that, through optimization, the sparsity of a signal can be exploited to recover it from far fewer samples than required by the Shannon-Nyquist sampling theorem. There are two conditions under which recovery is possible.[1] The first one is sparsity which requires the signal to be sparse in some domain. The second one is incoherence which is applied through the isometric property which is sufficient for sparse signals Possibility
of compressed data acquisition protocols which directly acquire just the important information Sparse sampling (CS) is a fast growing area of research. It neglects the extravagant acquisition process by measuring lesser values to reconstruct the image or signal. Sparse sampling is adopted successfully in various fields of image processing and proved its efficiency. Some of the image processing applications like face recognition, video encoding, Image encryption and reconstruction are presented here.
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...csandit
Time-delay estimation is an essential building block of many signal processing applications.This paper follows up on earlier work for acoustic source localization and time delay estimation
using pattern recognition techniques in the adverse environment such as reverberant rooms or underwater; it presents unprecedented high performance results obtained with supervised training of neural networks which challenge the state of the art and compares its performance to that of well-known methods such as the Generalized Cross-Correlation or Adaptive Eigenvalue Decomposition.
Satellite Image Classification with Deep Learning Surveyijtsrd
Satellite imagery is important for many applications including disaster response, law enforcement and environmental monitoring etc. These applications require the manual identification of objects in the imagery. Because the geographic area to be covered is very large and the analysts available to conduct the searches are few, thus an automation is required. Yet traditional object detection and classification algorithms are too inaccurate and unreliable to solve the problem. Deep learning is a part of broader family of machine learning methods that have shown promise for the automation of such tasks. It has achieved success in image understanding by means that of convolutional neural networks. The problem of object and facility recognition in satellite imagery is considered. The system consists of an ensemble of convolutional neural networks and additional neural networks that integrate satellite metadata with image features. Roshni Rajendran | Liji Samuel ""Satellite Image Classification with Deep Learning: Survey"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30031.pdf
Paper Url : https://www.ijtsrd.com/engineering/computer-engineering/30031/satellite-image-classification-with-deep-learning-survey/roshni-rajendran
A colour-based particle filter can achieve the goal of effective target tracking, but it has some drawbacks
when applied in the situations such as: the target and its background with similar colours, occlusion in
complex backgrounds, and deformation of the target. To deal with these problems, an improved particle
filter tracking system based on colour and moving-edge information is proposed in this study to provide
more accurate results in long-term tracking. In this system, the moving-edge information is used to ensure
that the target can be enclosed by the bounding box when encountering the problems mentioned above to
maintain the correctness of the target model. Using 100 targets in 10 video clips captured indoor and
outdoor as the test data, the experimental results show that the proposed system can track the targets
effectively to achieve an accuracy rate of 94.6%, higher than that of the colour-based particle filter
tracking system proposed by Nummiaro et al. (78.3%) [10]. For the case of occlusion, the former can also
achieve an accuracy rate of 91.8%, much higher than that of the latter (67.6%). The experimental results
reveal that using the target’s moving-edge information can enhance the accuracy and robustness of a
particle filter tracking system.
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
An Efficient Block Matching Algorithm Using Logical ImageIJERA Editor
Motion estimation, which has been widely used in various image sequence coding schemes, plays a key role in the transmission and storage of video signals at reduced bit rates. There are two classes of motion estimation methods, Block matching algorithms (BMA) and Pel-recursive algorithms (PRA). Due to its implementation simplicity, block matching algorithms have been widely adopted by various video coding standards such as CCITT H.261, ITU-T H.263, and MPEG. In BMA, the current image frame is partitioned into fixed-size rectangular blocks. The motion vector for each block is estimated by finding the best matching block of pixels within the search window in the previous frame according to matching criteria. The goal of this work is to find a fast method for motion estimation and motion segmentation using proposed model. Recent day Communication between ends is facilitated by the development in the area of wired and wireless networks. And it is a challenge to transmit large data file over limited bandwidth channel. Block matching algorithms are very useful in achieving the efficient and acceptable compression. Block matching algorithm defines the total computation cost and effective bit budget. To efficiently obtain motion estimation different approaches can be followed but above constraints should be kept in mind. This paper presents a novel method using three step and diamond algorithms with modified search pattern based on logical image for the block based motion estimation. It has been found that, the improved PSNR value obtained from proposed algorithm shows a better computation time (faster) as compared to original Three step Search (3SS/TSS ) method .The experimental results based on the number of video sequences were presented to demonstrate the advantages of proposed motion estimation technique.
A UTILIZATION OF CONVOLUTIONAL MATRIX METHODS ON SLICED HIPPOCAMPAL NEURON RE...ijscai
Present methodologies for cell segmentation on hippocampal neuron regions contain excess information leading to the creation of unwanted noise. To distinctly draw boundaries around the cells in each of the channels like DAPI, Cy5, TRITC, FITC, it is pertinent to start off by denoising the present data and cropping the relevant ROI for analysis to remove excess background information. Present edge detection methodologies like Canny Edge Detection create black and white outputs. It is difficult to accurately do edge detection with color throughout an entire image. As such, we utilized a more involved approach that uses pixel level comparisons to determine the existence of an edge points. By extrapolating all the available edge points, our algorithms are able to detect general edges throughout an imagine. To streamline the process, it has been accompanied with a GUI interface which allows for freehand crops. This information is stored in a downloadable txt file, which provides the necessary input for the thresholding and final cropping. Together, the interface works to create clean data which is ready for further analysis with algorithms likes FRCNN and YOLOv3.
A system was developed able to retrieve specific documents from a document collection. In this system the query is given in text by the user and then transformed into image. Appropriate features were in order to capture the general shape of the query, and ignore details due to noise or different fonts. In order to demonstrate the effectiveness of our system, we used a collection of noisy documents and we compared our results with those of a commercial OCR package.
An Image Enhancement Approach to Achieve High Speed using Adaptive Modified B...TELKOMNIKA JOURNAL
For real time application scenarios of image processing, satellite imaginary has grown more interest by researches due to the informative nature of image. Satellite images are captured using high quality cameras. These images are captured from space using on-board cameras. Wrong ISO setting, camera vibrations or wrong sensory setting causes noise. The degraded image can cause less efficient results during visual perception which is a challenging issue for researchers. Another reason is that noise corrupts the image during acquisition, transmission, interference or dust particles on the scanner screen of image from satellite to the earth stations. If quality degraded images are used for further processing then it may result in wrong information extraction. In order to cater this issue, image filtering or denoising approach is required.
Since remote sensing images are captured from space using on-board camera which requires high speed operating device which can provide better reconstruction quality by utilizing lesser power consumption. Recently various approaches have been proposed for image filtering. Key challenges with these approaches are reconstruction quality, operating speed, image quality by preserving information at edges on image.
Proposed approach is named as modified bilateral filter. In this approach bilateral filter and kernel schemes are combined. In order to overcome the drawbacks, modified bilateral filtering by using FPGA to perform the parallelism process for denoising is implemented.
Integrated Hidden Markov Model and Kalman Filter for Online Object Trackingijsrd.com
Visual prior from generic real-world images study to represent that objects in a scene. The existing work presented online tracking algorithm to transfers visual prior learned offline for online object tracking. To learn complete dictionary to represent visual prior with collection of real world images. Prior knowledge of objects is generic and training image set does not contain any observation of target object. Transfer learned visual prior to construct object representation using Sparse coding and Multiscale max pooling. Linear classifier is learned online to distinguish target from background and also to identify target and background appearance variations over time. Tracking is carried out within Bayesian inference framework and learned classifier is used to construct observation model. Particle filter is used to estimate the tracking result sequentially however, unable to work efficiently in noisy scenes. Time sift variance were not appropriated to track target object with observer value to prior information of object structure. Proposal HMM based kalman filter to improve online target tracking in noisy sequential image frames. The covariance vector is measured to identify noisy scenes. Discrete time steps are evaluated for identifying target object with background separation. Experiment conducted on challenging sequences of scene. To evaluate the performance of object tracking algorithm in terms of tracking success rate, Centre location error, Number of scenes, Learning object sizes, and Latency for tracking.
The determination of Region-of-Interest has been recognised as an important means by which
unimportant image content can be identified and excluded during image compression or image
modelling, however existing Region-of-Interest detection methods are computationally
expensive thus are mostly unsuitable for managing large number of images and the compression
of images especially for real-time video applications. This paper therefore proposes an
unsupervised algorithm that takes advantage of the high computation speed being offered by
Speeded-Up Robust Features (SURF) and Features from Accelerated Segment Test (FAST) to
achieve fast and efficient Region-of-Interest detection.
A common goal of the engineering field of signal processing is to reconstruct a signal from a series of sampling measurements. In general, this task is impossible because there is no way to reconstruct a signal during the times
that the signal is not measured. Nevertheless, with prior knowledge or assumptions about the signal, it turns out to
be possible to perfectly reconstruct a signal from a series of measurements. Over time, engineers have improved their understanding of which assumptions are practical and how they can be generalized. An early breakthrough in signal processing was the Nyquist–Shannon sampling theorem. It states that if the signal's highest frequency is less than half of the sampling rate, then the signal can be reconstructed perfectly. The main idea is that with prior knowledge about constraints on the signal’s frequencies, fewer samples are needed to reconstruct the signal. Sparse sampling (also known as, compressive sampling, or compressed sampling) is a signal processing technique for efficiently acquiring and reconstructing a signal, by finding solutions tounder determined linear systems. This is based on the principle that, through optimization, the sparsity of a signal can be exploited to recover it from far fewer samples than required by the Shannon-Nyquist sampling theorem. There are two conditions under which recovery is possible.[1] The first one is sparsity which requires the signal to be sparse in some domain. The second one is incoherence which is applied through the isometric property which is sufficient for sparse signals Possibility
of compressed data acquisition protocols which directly acquire just the important information Sparse sampling (CS) is a fast growing area of research. It neglects the extravagant acquisition process by measuring lesser values to reconstruct the image or signal. Sparse sampling is adopted successfully in various fields of image processing and proved its efficiency. Some of the image processing applications like face recognition, video encoding, Image encryption and reconstruction are presented here.
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...csandit
Time-delay estimation is an essential building block of many signal processing applications.This paper follows up on earlier work for acoustic source localization and time delay estimation
using pattern recognition techniques in the adverse environment such as reverberant rooms or underwater; it presents unprecedented high performance results obtained with supervised training of neural networks which challenge the state of the art and compares its performance to that of well-known methods such as the Generalized Cross-Correlation or Adaptive Eigenvalue Decomposition.
Satellite Image Classification with Deep Learning Surveyijtsrd
Satellite imagery is important for many applications including disaster response, law enforcement and environmental monitoring etc. These applications require the manual identification of objects in the imagery. Because the geographic area to be covered is very large and the analysts available to conduct the searches are few, thus an automation is required. Yet traditional object detection and classification algorithms are too inaccurate and unreliable to solve the problem. Deep learning is a part of broader family of machine learning methods that have shown promise for the automation of such tasks. It has achieved success in image understanding by means that of convolutional neural networks. The problem of object and facility recognition in satellite imagery is considered. The system consists of an ensemble of convolutional neural networks and additional neural networks that integrate satellite metadata with image features. Roshni Rajendran | Liji Samuel ""Satellite Image Classification with Deep Learning: Survey"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30031.pdf
Paper Url : https://www.ijtsrd.com/engineering/computer-engineering/30031/satellite-image-classification-with-deep-learning-survey/roshni-rajendran
A colour-based particle filter can achieve the goal of effective target tracking, but it has some drawbacks
when applied in the situations such as: the target and its background with similar colours, occlusion in
complex backgrounds, and deformation of the target. To deal with these problems, an improved particle
filter tracking system based on colour and moving-edge information is proposed in this study to provide
more accurate results in long-term tracking. In this system, the moving-edge information is used to ensure
that the target can be enclosed by the bounding box when encountering the problems mentioned above to
maintain the correctness of the target model. Using 100 targets in 10 video clips captured indoor and
outdoor as the test data, the experimental results show that the proposed system can track the targets
effectively to achieve an accuracy rate of 94.6%, higher than that of the colour-based particle filter
tracking system proposed by Nummiaro et al. (78.3%) [10]. For the case of occlusion, the former can also
achieve an accuracy rate of 91.8%, much higher than that of the latter (67.6%). The experimental results
reveal that using the target’s moving-edge information can enhance the accuracy and robustness of a
particle filter tracking system.
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
An Efficient Block Matching Algorithm Using Logical ImageIJERA Editor
Motion estimation, which has been widely used in various image sequence coding schemes, plays a key role in the transmission and storage of video signals at reduced bit rates. There are two classes of motion estimation methods, Block matching algorithms (BMA) and Pel-recursive algorithms (PRA). Due to its implementation simplicity, block matching algorithms have been widely adopted by various video coding standards such as CCITT H.261, ITU-T H.263, and MPEG. In BMA, the current image frame is partitioned into fixed-size rectangular blocks. The motion vector for each block is estimated by finding the best matching block of pixels within the search window in the previous frame according to matching criteria. The goal of this work is to find a fast method for motion estimation and motion segmentation using proposed model. Recent day Communication between ends is facilitated by the development in the area of wired and wireless networks. And it is a challenge to transmit large data file over limited bandwidth channel. Block matching algorithms are very useful in achieving the efficient and acceptable compression. Block matching algorithm defines the total computation cost and effective bit budget. To efficiently obtain motion estimation different approaches can be followed but above constraints should be kept in mind. This paper presents a novel method using three step and diamond algorithms with modified search pattern based on logical image for the block based motion estimation. It has been found that, the improved PSNR value obtained from proposed algorithm shows a better computation time (faster) as compared to original Three step Search (3SS/TSS ) method .The experimental results based on the number of video sequences were presented to demonstrate the advantages of proposed motion estimation technique.
A UTILIZATION OF CONVOLUTIONAL MATRIX METHODS ON SLICED HIPPOCAMPAL NEURON RE...ijscai
Present methodologies for cell segmentation on hippocampal neuron regions contain excess information leading to the creation of unwanted noise. To distinctly draw boundaries around the cells in each of the channels like DAPI, Cy5, TRITC, FITC, it is pertinent to start off by denoising the present data and cropping the relevant ROI for analysis to remove excess background information. Present edge detection methodologies like Canny Edge Detection create black and white outputs. It is difficult to accurately do edge detection with color throughout an entire image. As such, we utilized a more involved approach that uses pixel level comparisons to determine the existence of an edge points. By extrapolating all the available edge points, our algorithms are able to detect general edges throughout an imagine. To streamline the process, it has been accompanied with a GUI interface which allows for freehand crops. This information is stored in a downloadable txt file, which provides the necessary input for the thresholding and final cropping. Together, the interface works to create clean data which is ready for further analysis with algorithms likes FRCNN and YOLOv3.
A system was developed able to retrieve specific documents from a document collection. In this system the query is given in text by the user and then transformed into image. Appropriate features were in order to capture the general shape of the query, and ignore details due to noise or different fonts. In order to demonstrate the effectiveness of our system, we used a collection of noisy documents and we compared our results with those of a commercial OCR package.
An Image Enhancement Approach to Achieve High Speed using Adaptive Modified B...TELKOMNIKA JOURNAL
For real time application scenarios of image processing, satellite imaginary has grown more interest by researches due to the informative nature of image. Satellite images are captured using high quality cameras. These images are captured from space using on-board cameras. Wrong ISO setting, camera vibrations or wrong sensory setting causes noise. The degraded image can cause less efficient results during visual perception which is a challenging issue for researchers. Another reason is that noise corrupts the image during acquisition, transmission, interference or dust particles on the scanner screen of image from satellite to the earth stations. If quality degraded images are used for further processing then it may result in wrong information extraction. In order to cater this issue, image filtering or denoising approach is required.
Since remote sensing images are captured from space using on-board camera which requires high speed operating device which can provide better reconstruction quality by utilizing lesser power consumption. Recently various approaches have been proposed for image filtering. Key challenges with these approaches are reconstruction quality, operating speed, image quality by preserving information at edges on image.
Proposed approach is named as modified bilateral filter. In this approach bilateral filter and kernel schemes are combined. In order to overcome the drawbacks, modified bilateral filtering by using FPGA to perform the parallelism process for denoising is implemented.
Integrated Hidden Markov Model and Kalman Filter for Online Object Trackingijsrd.com
Visual prior from generic real-world images study to represent that objects in a scene. The existing work presented online tracking algorithm to transfers visual prior learned offline for online object tracking. To learn complete dictionary to represent visual prior with collection of real world images. Prior knowledge of objects is generic and training image set does not contain any observation of target object. Transfer learned visual prior to construct object representation using Sparse coding and Multiscale max pooling. Linear classifier is learned online to distinguish target from background and also to identify target and background appearance variations over time. Tracking is carried out within Bayesian inference framework and learned classifier is used to construct observation model. Particle filter is used to estimate the tracking result sequentially however, unable to work efficiently in noisy scenes. Time sift variance were not appropriated to track target object with observer value to prior information of object structure. Proposal HMM based kalman filter to improve online target tracking in noisy sequential image frames. The covariance vector is measured to identify noisy scenes. Discrete time steps are evaluated for identifying target object with background separation. Experiment conducted on challenging sequences of scene. To evaluate the performance of object tracking algorithm in terms of tracking success rate, Centre location error, Number of scenes, Learning object sizes, and Latency for tracking.
INFORMATION SATURATION IN MULTISPECTRAL PIXEL LEVEL IMAGE FUSIONIJCI JOURNAL
The availability of imaging sensors operating in multiple spectral bands has led to the requirement of
image fusion algorithms that would combine the image from these sensors in an efficient way to give an
image that is more informative as well as perceptible to human eye. Multispectral image fusion is the
process of combining images from different spectral bands that are optically acquired. In this paper, we
used a pixel-level image fusion based on principal component analysis that combines satellite images of the
same scene from seven different spectral bands. The purpose of using principal component analysis
technique is that it is best method for Grayscale image fusion and gives better results. The main aim of
PCA technique is to reduce a large set of variables into a small set which still contains most of the
information that was present in the large set. The paper compares different parameters namely, entropy,
standard deviation, correlation coefficient etc. for different number of images fused from two to seven.
Finally, the paper shows that the information content in an image gets saturated after fusing four images.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
HOL, GDCT AND LDCT FOR PEDESTRIAN DETECTIONcsandit
In this paper, we present and analyze different approaches implemented here to resolve pedestrian detection problem. Histograms of Oriented Laplacian (HOL) is a descriptor of
characteristic, it aims to highlight objects in digital images, Discrete Cosine Transform DCT with its two version global (GDCT) and local (LDCT), it changes image's pixel into frequencies coefficients and then we use them as a characteristics in the process. We implemented
independently these methods and tried to combine it and used there outputs in a classifier, the new generated classifier has proved it efficiency in certain cases. The performance of those
methods and their combination is tested on most popular Dataset in pedestrian detection, which
are INRIA and Daimler.
HOL, GDCT AND LDCT FOR PEDESTRIAN DETECTIONcscpconf
In this paper, we present and analyze different approaches implemented here to resolve pedestrian detection problem. Histograms of Oriented Laplacian (HOL) is a descriptor of characteristic, it aims to highlight objects in digital images, Discrete Cosine Transform DCT
with its two version global (GDCT) and local (LDCT), it changes image's pixel into frequencies coefficients and then we use them as characteristics in the process. We implemented independently these methods and tried to combine it and used their outputs in a classifier, the newly generated classifier has proved its efficiency in certain cases. The performance of those methods and their combination is tested on most popular Dataset in pedestrian detection, which is INRIA and Daimler.
A Pattern Classification Based approach for Blur Classificationijeei-iaes
Blur type identification is one of the most crucial step of image restoration. In case of blind restoration of such images, it is generally assumed that the blur type is known prior to restoration of such images. However, it is not practical in real applications. So, blur type identification is extremely desirable before application of blind restoration technique to restore a blurred image. An approach to categorize blur in three classes namely motion, defocus, and combined blur is presented in this paper. Curvelet transform based energy features are utilized as features of blur patterns and a neural network is designed for classification. The simulation results show preciseness of proposed approach.
Detecting and Shadows in the HSV Color Space using Dynamic Thresholds IJECEIAES
The detection of moving objects in a video sequence is an essential step in almost all the systems of vision by computer. However, because of the dynamic change in natural scenes, the detection of movement becomes a more difficult task. In this work, we propose a new method for the detection moving objects that is robust to shadows, noise and illumination changes. For this purpose, the detection phase of the proposed method is an adaptation of the MOG approach where the foreground is extracted by considering the HSV color space. To allow the method not to take shadows into consideration during the detection process, we developed a new shade removal technique based on a dynamic thresholding of detected pixels of the foreground. The calculation model of the threshold is established by two statistical analysis tools that take into account the degree of the shadow in the scene and the robustness to noise. Experiments undertaken on a set of video sequences showed that the method put forward provides better results compared to existing methods that are limited to using static thresholds.
Fingerprints are imprints formed by friction
ridges of the skin and thumbs. They have long been used for
identification because of their immutability and individuality.
Immutability refers to the permanent and unchanging character
of the pattern on each finger. Individuality refers to the
uniqueness of ridge details across individuals; the probability
that two fingerprints are alike is about 1 in 1.9x1015. In despite of
this improvement which is adopted by the Federal Bureau of
Investigation (FBI), the fact still is “The larger the fingerprint
files became, the harder it was to identify somebody from their
fingerprints alone. Moreover, the fingerprint requires one of the
largest data templates in the biometric field”. The finger data
template can range anywhere from several hundred bytes to over
1,000 bytes depending upon the level of security that is required
and the method that is used to scan one's fingerprint. For these
reasons this work is motivated to present another way to tackle
the problem that is relies on the properties of Vector
Quantization coding algorithm.
Proposed Multi-object Tracking Algorithm Using Sobel Edge Detection operatorQUESTJOURNAL
ABSTRACT:Tracking of moving objects that is called video tracking is used for measuring motion parameters and obtaining a visual record of the moving objects, it is an important area of application in image processing. In general there are two different approaches to obtain object tracking: the first is Recognition-based Tracking, and the second is the Motion-based Tracking. Video tracking system raises a wide possibility in today’s society. This system is used in various applications such as military, security, monitoring, robotic, and nowadays in dayto-day applications. However the video tracking systems still have many open problems and various research activities in a video tracking system are explores. This paper presents an algorithm for video tracking of any moving targets with the uses of contour based detection technique that depends on the sobel operator. The proposed system is suitable for indoor and outdoor applications. Our approach has the advantage of extending the applicability of tracking system and also, as presented here improves the performance of the tracker making feasible high frame rate video tracking. The goal of the tracking system is to analyze the video frames and estimate the position of a part of the input video frame (usually a moving object), our approach can detect, tracked any object more than one object and calculate the position of the moving objects. Therefore, the aim of this paper is to construct a motion tracking system for moving objects. Where, at the end of this paper, the detail outcome and result are discussed using experiments results of the proposed technique
In this paper the process of 3D modelling from video is presented. Analysed previous research related to
this process, and specifically described algorithms for detecting and matching key points. We described
their advantages and disadvantages, and made a critical analysis of algorithms. In this paper, the three
detectors (SUSAN, Plessey and Förstner) are tested and compare. We used video taken with hand held
camera of a cube and compare these detectors on it (taking into account their parameters of accuracy and
repeatability). In conclusion, we practically made 3D model of the cube from video used these detectors in
the first step of the process and three algorithms (RANSAC, MSAC and MLESAC) for matching data.
Efficient resampling features and convolution neural network model for image ...IJEECSIAES
The extended utilization of picture-enhancing or manipulating tools has led to ease of manipulating multimedia data which includes digital images. These manipulations will disturb the truthfulness and lawfulness of images, resulting in misapprehension, and might disturb social security. The image forensic approach has been employed for detecting whether or not an image has been manipulated with the usage of positive attacks which includes splicing, and copy-move. This paper provides a competent tampering detection technique using resampling features and convolution neural network (CNN). In this model range spatial filtering (RSF)-CNN, throughout preprocessing the image is divided into consistent patches. Then, within every patch, the resampling features are extracted by utilizing affine transformation and the Laplacian operator. Then, the extracted features are accumulated for creating descriptors by using CNN. A wide-ranging analysis is performed for assessing tampering detection and tampered region segmentation accuracies of proposed RSF-CNN based tampering detection procedures considering various falsifications and post-processing attacks which include joint photographic expert group (JPEG) compression, scaling, rotations, noise additions, and more than one manipulation. From the achieved results, it can be visible the RSF-CNN primarily based tampering detection with adequately higher accurateness than existing tampering detection methodologies.
Efficient resampling features and convolution neural network model for image ...nooriasukmaningtyas
The extended utilization of picture-enhancing or manipulating tools has led to ease of manipulating multimedia data which includes digital images. These manipulations will disturb the truthfulness and lawfulness of images, resulting in misapprehension, and might disturb social security. The image forensic approach has been employed for detecting whether or not an image has been manipulated with the usage of positive attacks which includes splicing, and copy-move. This paper provides a competent tampering detection technique using resampling features and convolution neural network (CNN). In this model range spatial filtering (RSF)-CNN, throughout preprocessing the image is divided into consistent patches. Then, within every patch, the resampling features are extracted by utilizing affine transformation and the Laplacian operator. Then, the extracted features are accumulated for creating descriptors by using CNN. A wide-ranging analysis is performed for assessing tampering detection and tampered region segmentation accuracies of proposed RSF-CNN based tampering detection procedures considering various falsifications and post-processing attacks which include joint photographic expert group (JPEG) compression, scaling, rotations, noise additions, and more than one manipulation. From the achieved results, it can be visible the RSF-CNN primarily based tampering detection with adequately higher accurateness than existing tampering detection methodologies.
Similar to Fast Feature Pyramids for Object Detection (20)
Object Oriented Programming -- Dr Robert Harlesuthi
This course absorbs what was “Programming Methods” and provides a more formal look at Object Oriented programming with an emphasis on Java
Four Parts ---
1. Computer Fundamentals
2. Object-Oriented Concepts
3. The Java Platform
4. Design Patterns and OOP design examples
THE ROLE OF EDGE COMPUTING IN INTERNET OF THINGSsuthi
Edge computing refers to the enabling technologies allowing computation to be performed at the edge of the network, on downstream data on behalf of cloud services and upstream data on behalf of IoT services. Here we define “edge” as any computing and network resources along the path between data sources and cloud data centers. For example, a smart phone is the edge between body things and cloud, a gateway in a smart home is the edge between home things and cloud, a micro data center and a cloudlet is the edge between a mobile device and cloud. The rationale of edge computing is that computing should happen at the proximity of data sources. From our point of view, edge computing is interchangeable with fog computing, but edge computing focus more toward the things side, while fog computing focus more on the infrastructure side. Edge computing could have as big an impact on our society as has the cloud computing.
The proliferation of Internet of Things (IoT) and the success of rich cloud services have pushed the horizon of a new computing paradigm, edge computing, which calls for processing the data at the edge of the network. Edge computing has the potential to address the concerns of response time requirement, battery life constraint, bandwidth cost saving, as well as data safety and privacy. In this paper, we introduce the definition of edge computing, followed by several case studies, ranging from cloud offloading to smart home and city, as well as collaborative edge to materialize the concept of edge computing. Finally, we present several challenges and opportunities in the field of edge computing, and hope this paper will gain attention from the community and inspire more research in this direction.
Edge computing refers to the enabling technologies allowing computation to be performed at the edge of the network, on downstream data on behalf of cloud services and upstream data on behalf of IoT services. Here we define “edge” as any computing and network resources along the path between data sources and cloud data centers. For example, a smart phone is the edge between body things and cloud, a gateway in a smart home is the edge between home things and cloud, a micro data center and a cloudlet is the edge between a mobile device and cloud. The rationale of edge computing is that computing should happen at the proximity of data sources. From our point of view, edge computing is interchangeable with fog computing, but edge computing focus more toward the things side, while fog computing focus more on the infrastructure side. Edge computing could have as big an impact on our society as has the cloud computing.
Document Classification Using KNN with Fuzzy Bags of Word Representationsuthi
Abstract — Text classification is used to classify the documents depending on the words, phrases and word combinations according to the declared syntaxes. There are many applications that are using text classification such as artificial intelligence, to maintain the data according to the category and in many other. Some keywords which are called topics are selected to classify the given document. Using these Topics the main idea of the document can be identified. Selecting the Topics is an important task to classify the document according to the category. In this proposed system keywords are extracted from documents using TF-IDF and Word Net. TF-IDF algorithm is mainly used to select the important words by which document can be classified. Word Net is mainly used to find similarity between these candidate words. The words which are having the maximum similarity are considered as Topics(keywords). In this experiment we used TF-IDF model to find the similar words so that to classify the document. Decision tree algorithm gives the better accuracy for text classification when compared to other algorithms fuzzy system to classify text written in natural language according to topic. It is necessary to use a fuzzy classifier for this task, due to the fact that a given text can cover several topics with different degrees. In this context, traditional classifiers are inappropriate, as they attempt to sort each text in a single class in a winner-takes-all fashion. The classifier we proposeautomatically learns its fuzzy rules from training examples. We have applied it to classify news articles, and the results we obtained are promising. The dimensionality of a vector is very important in text classification. We can decrease this dimensionality by using clustering based on fuzzy logic. Depending on the similarity we can classify the document and thus they can be formed into clusters according to their Topics. After formation of clusters one can easily access the documents and save the documents very easily. In this we can find the similarity and summarize the words called Topics which can be used to classify the Documents.
Short Notes on Automata Theory
Automata theory is the study of abstract machines and automata, as well as the computational problems that can be solved using them. It is a theory in theoretical computer science and discrete mathematics (a subject of study in both mathematics and computer science). The word automata (the plural of automaton) comes from the Greek word αὐτόματα, which means "self-making".
OBJECT ORIENTED PROGRAMMING LANGUAGE - SHORT NOTESsuthi
Short Notes on OOP
Object-oriented programming (OOP) is a programming paradigm based on the concept of "objects", which can contain data, in the form of fields (often known as attributes or properties), and code, in the form of procedures (often known as methods). A feature of objects is an object's procedures that can access and often modify the data fields of the object with which they are associated (objects have a notion of "this" or "self"). In OOP, computer programs are designed by making them out of objects that interact with one another. OOP languages are diverse, but the most popular ones are class-based, meaning that objects are instances of classes, which also determine their types.
PARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTESsuthi
Short Notes on Parallel Computing
Parallel computing is a type of computation in which many calculations or the execution of processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time.
C is a general-purpose, procedural, imperative computer programming language developed in 1972 by Dennis M. Ritchie at the Bell Telephone Laboratories to develop the UNIX operating system. C is the most widely used computer language. It keeps fluctuating at number one scale of popularity along with Java programming language, which is also
equally popular and most widely used among modern software programmers.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
1. Fast Feature Pyramids for Object Detection
Piotr Dollar, Ron Appel, Serge Belongie, and Pietro Perona
Abstract—Multi-resolution image features may be approximated via extrapolation from nearby scales, rather than being computed
explicitly. This fundamental insight allows us to design object detection algorithms that are as accurate, and considerably faster, than
the state-of-the-art. The computational bottleneck of many modern detectors is the computation of features at every scale of a finely-
sampled image pyramid. Our key insight is that one may compute finely sampled feature pyramids at a fraction of the cost, without
sacrificing performance: for a broad family of features we find that features computed at octave-spaced scale intervals are sufficient to
approximate features on a finely-sampled pyramid. Extrapolation is inexpensive as compared to direct feature computation. As a result,
our approximation yields considerable speedups with negligible loss in detection accuracy. We modify three diverse visual recognition
systems to use fast feature pyramids and show results on both pedestrian detection (measured on the Caltech, INRIA, TUD-Brussels
and ETH data sets) and general object detection (measured on the PASCAL VOC). The approach is general and is widely applicable to
vision algorithms requiring fine-grained multi-scale analysis. Our approximation is valid for images with broad spectra (most natural
images) and fails for images with narrow band-pass spectra (e.g., periodic textures).
Index Terms—Visual features, object detection, image pyramids, pedestrian detection, natural image statistics, real-time systems
Ç
1 INTRODUCTION
MULTI-RESOLUTION multi-orientation decompositions
are one of the foundational techniques of image anal-
ysis. The idea of analyzing image structure separately at
every scale and orientation originated from a number of
sources: measurements of the physiology of mammalian
visual systems [1], [2], [3], principled reasoning about the
statistics and coding of visual information [4], [5], [6], [7]
(Gabors, DOGs, and jets), harmonic analysis [8], [9] (wave-
lets), and signal processing [9], [10] (multirate filtering).
Such representations have proven effective for visual proc-
essing tasks such as denoising [11], image enhancement
[12], texture analysis [13], stereoscopic correspondence [14],
motion flow [15], [16], attention [17], boundary detection
[18] and recognition [19], [20], [21].
It has become clear that such representations are best at
extracting visual information when they are overcomplete,
i.e., when one oversamples scale, orientation and other ker-
nel properties. This was suggested by the architecture of
the primate visual system [22], where striate cortical cells
(roughly equivalent to a wavelet expansion of an image)
outnumber retinal ganglion cells (a representation close to
image pixels) by a factor ranging from 102
to 103
. Empirical
studies in computer vision provide increasing evidence in
favor of overcomplete representations [21], [23], [24], [25],
[26]. Most likely the robustness of these representations
with respect to changes in viewpoint, lighting, and image
deformations is a contributing factor to their superior
performance.
To understand the value of richer representations, it is
instructive to examine the reasons behind the breathtaking
progress in visual category detection during the past ten
years. Take, for instance, pedestrian detection. Since the
groundbreaking work of Viola and Jones (VJ) [27], [28], false
positive rates have decreased two orders of magnitude. At
80 percent detection rate on the INRIA pedestrian data set
[21], VJ outputs over 10 false positives per image (FPPI),
HOG [21] outputs $1 FPPI, and more recent methods [29],
[30] output well under 0.1 FPPI (data from [31], [32]). In
comparing the different detection schemes one notices the
representations at the front end are progressively enriched
(e.g., more channels, finer scale sampling, enhanced normal-
ization schemes); this has helped fuel the dramatic improve-
ments in detection accuracy witnessed over the course of
the last decade.
Unfortunately, improved detection accuracy has been
accompanied by increased computational costs. The VJ
detector ran at $15 frames per second (fps) over a decade
ago, on the other hand, most recent detectors require multi-
ple seconds to process a single image as they compute richer
image representations [31]. This has practical importance:
in many applications of visual recognition, such as robotics,
human computer interaction, automotive safety, and mobile
devices, fast detection rates and low computational require-
ments are of the essence.
Thus, while increasing the redundancy of the representa-
tion offers improved detection and false-alarm rates, it is
paid for by increased computational costs. Is this a neces-
sary tradeoff? In this work we offer the hoped-for but sur-
prising answer: no.
We demonstrate how to compute richer representations
without paying a large computational price. How is this
possible? The key insight is that natural images have fractal
statistics [7], [33], [34] that we can exploit to reliably predict
P. Dollar is with the Interactive Visual Media Group at Microsoft
Research, One Microsoft Way, Redmond, WA 98052.
E-mail: pdollar@microsoft.com.
R. Appel and P. Perona are with the Department of Electrical Engineering,
California Institute of Technology, Pasadena, CA.
E-mail: {appel, perona}@caltech.edu.
S. Belongie is with Cornell NYC Tech and the Cornell Computer
Science Department.
Manuscript received 14 Feb. 2013; revised 15 Dec. 2013; accepted 8 Jan. 2014.
Date of publication 15 Jan. 2014; date of current version 10 July 2014.
Recommended for acceptance by D. Forsyth.
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TPAMI.2014.2300479
1532 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 8, AUGUST 2014
0162-8828 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
2. image structure across scales. Our analysis and experiments
show that this makes it possible to inexpensively estimate
features at a dense set of scales by extrapolating computa-
tions carried out expensively, but infrequently, at a coarsely
sampled set of scales.
Our insight leads to considerably decreased run-times
for state-of-the-art object detectors that rely on rich repre-
sentations, including histograms of gradients [21], with neg-
ligible impact on their detection rates. We demonstrate the
effectiveness of our proposed fast feature pyramids with
three distinct detection frameworks including integral chan-
nel features (ICF) [29], aggregate channel features (a novel
variant of integral channel features), and deformable part
models (DPM) [35]. We show results for both pedestrian
detection (measured on the Caltech [31], INRIA [21], TUD-
Brussels [36] and ETH [37] data sets) and general object
detection (measured on the PASCAL VOC [38]). Demon-
strated speedups are significant and impact on accuracy is
relatively minor.
Building on our work on fast feature pyramids (first pre-
sented in [39]), a number of systems show state-of-the-art
accuracy while running at frame rate on 640 Â 480 images.
Aggregate channel features, described in this paper, operate
at over 30 fps while achieving top results on pedestrian
detection. Crosstalk cascades [40] use fast feature pyramids
and couple detector evaluations of nearby windows to
achieve speeds of 35-65 fps. Benenson et al. [30] imple-
mented fast feature pyramids on a GPU, and with addi-
tional innovations achieved detection rates of over 100 fps.
In this work we examine and analyze feature scaling and its
effect on object detection in far more detail than in our pre-
vious work [39].
The remainder of this paper is organized as follows.
We review related work in Section 2. In Section 3 we
show that it is possible to create high fidelity approxima-
tions of multiscale gradient histograms using gradients
computed at a single scale. In Section 4 we generalize this
finding to a broad family of feature types. We describe
our efficient scheme for computing finely sampled feature
pyramids in Section 5. In Section 6 we show applications
of fast feature pyramids to object detection, resulting in
considerable speedups with minor loss in accuracy. We
conclude in Section 7.
2 RELATED WORK
Significant research has been devoted to scale space theory
[41], including real time implementations of octave and
half-octave image pyramids [42], [43]. Sparse image pyra-
mids often suffice for certain approximations, e.g., [42]
shows how to recover a disk’s characteristic scale using
half-octave pyramids. Although only loosely related, these
ideas provide the intuition that finely sampled feature pyra-
mids can perhaps be approximated.
Fast object detection has been of considerable interest in
the community. Notable recent efforts for increasing detec-
tion speed include work by Felzenszwalb et al. [44] and
Pedersoli et al. [45] on cascaded and coarse-to-fine deform-
able part models, respectively, Lampert et al.’s [46] applica-
tion of branch and bound search for detection, and Dollar
et al.’s work on crosstalk cascades [40]. Cascades [27], [47],
[48], [49], [50], coarse-to-fine search [51], distance trans-
forms [52], etc., all focus on optimizing classification speed
given precomputed image features. Our work focuses on
fast feature pyramid construction and is thus complemen-
tary to such approaches.
An effective framework for object detection is the sliding
window paradigm [27], [53]. Top performing methods on
pedestrian detection [31] and the PASCAL VOC [38] are
based on sliding windows over multiscale feature pyramids
[21], [29], [35]; fast feature pyramids are well suited for such
sliding window detectors. Alternative detection paradigms
have been proposed [54], [55], [56], [57], [58], [59]. Although
a full review is outside the scope of this work, the approxi-
mations we propose could potentially be applicable to such
schemes as well.
As mentioned, a number of state-of-the-art detectors
have recently been introduced that exploit our fast feature
pyramid construction to operate at frame rate including
[40] and [30]. Alternatively, parallel implementation using
GPUs [60], [61], [62] can achieve fast detection while using
rich representations but at the cost of added complexity
and hardware requirements. Zhu et al. [63] proposed fast
computation of gradient histograms using integral histo-
grams [64]; the proposed system was real time for single-
scale detection only. In scenarios such as automotive appli-
cations, real time systems have also been demonstrated
[65], [66]. The insights outlined in this paper allow for real
time multiscale detection in general, unconstrained settings.
3 MULTISCALE GRADIENT HISTOGRAMS
We begin by exploring a simple question: given image gra-
dients computed at one scale, is it possible to approximate gradient
histograms at a nearby scale solely from the computed gradients?
If so, then we can avoid computing gradients over a finely
sampled image pyramid. Intuitively, one would expect this
to be possible, as significant image structure is preserved
when an image is resampled. We begin with an in-depth
look at a simple form of gradient histograms and develop a
more general theory in Section 4.
A gradient histogram measures the distribution of the
gradient angles within an image. Let Iðx; yÞ denote an
m  n discrete signal, and @I=@x and @I=@y denote the dis-
crete derivatives of I (typically 1D centered first differen-
ces are used). Gradient magnitude and orientation are
defined by: Mði; jÞ2
¼ @I
@x ði; jÞ2
þ @I
@y ði; jÞ2
and Oði; jÞ ¼
arctan
À @I
@y ði; jÞ= @I
@x ði; jÞ
Á
. To compute the gradient histo-
gram of an image, each pixel casts a vote, weighted by its
gradient magnitude, for the bin corresponding to its gra-
dient orientation. After the orientation O is quantized into
Q bins so that Oði; jÞ 2 f1; Qg, the qth bin of the histogram
is defined by: hq ¼
P
i;j Mði; jÞ1½Oði; jÞ ¼ qŠ, where 1 is the
indicator function. In the following everything that holds
for global histograms also applies to local histograms
(defined identically except for the range of the indices i
and j).
3.1 Gradient Histograms in Upsampled Images
Intuitively the information content of an upsampled image
is similar to that of the original, lower-resolution image
(upsampling does not create new structure). Assume I is a
DOLLAR ET AL.: FAST FEATURE PYRAMIDS FOR OBJECT DETECTION 1533
3. continuous signal, and let I0
denote I upsampled by a factor
of k: I0
ðx; yÞ Iðx=k; y=kÞ. Using the definition of a deriva-
tive, one can show that @I0
@x ði; jÞ ¼ 1
k
@I
@x ði=k; j=kÞ, and likewise
for @I0
@y , which simply states the intuitive fact that the rate of
change in the upsampled image is k times slower the rate of
change in the original image. While not exact, the above
also holds approximately for interpolated discrete signals.
Let M0
ði; jÞ % 1
k Mðdi=ke; dj=keÞ denote the gradient magni-
tude in an upsampled discrete image. Then:
Xkn
i¼1
Xkm
j¼1
M0
ði; jÞ %
Xkn
i¼1
Xkm
j¼1
1
k
Mðdi=ke; dj=keÞ
¼ k2
Xn
i¼1
Xm
j¼1
1
k
Mði; jÞ ¼ k
Xn
i¼1
Xm
j¼1
Mði; jÞ:
(1)
Thus, the sum of gradient magnitudes in the original and
upsampled image should be related by about a factor of k.
Angles should also be mostly preserved since @I0
@x ði; jÞ
@I0
@y ði; jÞ % @I
@x ði=k; j=kÞ
@I
@y ði=k; j=kÞ. Therefore, according to
the definition of gradient histograms, we expect the rela-
tionship between hq (computed over I) and h0
q (computed
over I0
) to be: h0
q % khq. This allows us to approximate gra-
dient histograms in an upsampled image using gradients
computed at the original scale.
Experiments. One may verify experimentally that in
images of natural scenes, upsampled using bilinear interpo-
lation, the approximation h0
q % khq is reasonable. We use
two sets of images for these experiments, one class specific
and one class independent. First, we use the 1;237 cropped
pedestrian images from the INRIA pedestrians training
data set [21]. Each image is 128 Â 64 and contains a
pedestrian approximately 96 pixels tall. The second image
set contains 128 Â 64 windows cropped at random positions
from the 1;218 images in the INRIA negative training set.
We sample 5,000 windows but exclude nearly uniform win-
dows, i.e., those with average gradient magnitude under
0:01, resulting in 4,280 images. We refer to the two sets as
‘pedestrian images’ and ‘natural images,’ although the latter
is biased toward scenes that may (but do not) contain
pedestrians.
In order to measure the fidelity of this approximation, we
define the ratio rq ¼ h0
q=hq and quantize orientation into
Q ¼ 6 bins. Fig. 1a shows the distribution of rq for one bin
on the 1;237 pedestrian and 4; 280 natural images given an
upsampling of k ¼ 2 (results for other bins were similar). In
both cases the mean is m % 2, as expected, and the variance
is relatively small, meaning the approximation is unbiased
and reasonable.
Thus, although individual gradients may change, gradi-
ent histograms in an upsampled and original image will be
related by a multiplicative constant roughly equal to the
scale change between them. We examine gradient histo-
grams in downsampled images next.
3.2 Gradient Histograms in Downsampled Images
While the information content of an upsampled image is
roughly the same as that of the original image, information
is typically lost during downsampling. However, we find
that the information loss is consistent and the resulting
approximation takes on a similarly simple form.
If I contains little high frequency energy, then the
approximation h0
q % khq derived in Section 3.1 should apply.
In general, however, downsampling results in loss of high
Fig. 1. Behavior of gradient histograms in images resampled by a factor of two. (a) Upsampling gradients. Given images I and I0
where I0
denotes I upsampled by two, and corresponding gradient magnitude images M and M0
, the ratio SM=SM0
should be approximately 2. The
middle/bottom panels show the distribution of this ratio for gradients at fixed orientation over pedestrian/natural images. In both cases the
mean m % 2, as expected, and the variance is relatively small. (b) Downsampling gradients. Given images I and I0
where I0
denotes I down-
sampled by two, the ratio SM=SM0
% 0:34, not 0:5 as might be expected from (a) as downsampling results in loss of high frequency content.
(c) Downsampling normalized gradients. Given normalized gradient magnitude images eM and eM0
, the ratio S eM=S eM0
% 0:27. Instead of trying
to derive analytical expressions governing the scaling properties of various feature types under different resampling factors, in Section 4 we
describe a general law governing feature scaling.
1534 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 8, AUGUST 2014
4. frequency content which can lead to measured gradients
undershooting the extrapolated gradients. Let I0
now
denote I downsampled by a factor of k. We expect that hq
(computed over I) and h0
q (computed over I0
) will satisfy
h0
q hq=k. The question we seek to answer here is whether
the information loss is consistent.
Experiments. As before, define rq ¼ h0
q=hq. In Fig. 1b we
show the distribution of rq for a single bin on the pedes-
trian and natural images given a downsampling factor of
k ¼ 2. Observe that the information loss is consistent: rq is
normally distributed around m % 0:34 0:5 for natural
images (and similarly m % 0:33 for pedestrians). This
implies that h0
q % mhq could serve as a reasonable approxi-
mation for gradient histograms in images downsampled
by k ¼ 2.
In other words, similarly to upsampling, gradient histo-
grams computed over original and half resolution images
tend to differ by a multiplicative constant (although the con-
stant is not the inverse of the sampling factor). In Fig. 2 we
show the quality of the above approximations on example
images. The agreement between predictions and observa-
tions is accurate for typical images (but fails for images with
atypical statistics).
3.3 Histograms of Normalized Gradients
Suppose we replaced the gradient magnitude M by the nor-
malized gradient magnitude eM defined as eMði; jÞ ¼
Mði; jÞ=ðMði; jÞ þ 0:005Þ, where M is the average gradient
magnitude in each 11 Â 11 image patch (computed by
convolving M with an L1 normalized 11 Â 11 triangle filter).
Using the normalized gradient eM gives improved results in
the context of object detection (see Section 6). Observe that
we have now introduced an additional nonlinearity to the
gradient computation; do the previous results for gradient
histograms still hold if we use eM instead of M?
In Fig. 1c we plot the distribution of rq ¼ h0
q=hq for histo-
grams of normalized gradients given a downsampling fac-
tor of k ¼ 2. As with the original gradient histograms, the
distributions of rq are normally distributed and have similar
means for pedestrian and natural images (m % 0:26 and
m % 0:27, respectively). Observe, however, that the expected
value of rq for normalized gradient histograms is quite dif-
ferent than for the original histograms (Fig. 1b).
Deriving analytical expressions governing the scaling
properties of progressively more complex feature types
would be difficult or even impossible. Instead, in Section 4
we describe a general law governing feature scaling.
4 STATISTICS OF MULTISCALE FEATURES
To understand more generally how features behave in
resampled images, we turn to the study of natural image
statistics [7], [33]. The analysis below provides a deep
understanding of the behavior of multiscale features. The
practical result is a simple yet powerful approach for pre-
dicting the behavior of gradients and other low-level fea-
tures in resampled images without resorting to analytical
derivations that may be difficult except under the simplest
conditions.
We begin by defining a broad family of features. Let V be
any low-level shift invariant function that takes an image I
and creates a new channel image C ¼ VðIÞ where a channel
C is a per-pixel feature map such that output pixels in C are
Fig. 2. Approximating gradient histograms in images resampled by a factor of two. For each image set, we take the original image (green border) and
generate an upsampled (blue) and downsampled (orange) version. At each scale we compute a gradient histogram with eight bins, multiplying each
bin by 0:5 and 1=0:34 in the upsampled and downsampled histogram, respectively. Assuming the approximations from Section 3 hold, the three nor-
malized gradient histograms should be roughly equal (the blue, green, and orange bars should have the same height at each orientation). For the first
four cases, the approximations are fairly accurate. In the last two cases, showing highly structured Brodatz textures with significant high frequency
content, the downsampling approximation fails. The first four images are representative, the last two are carefully selected to demonstrate images
with atypical statistics.
DOLLAR ET AL.: FAST FEATURE PYRAMIDS FOR OBJECT DETECTION 1535
5. computed from corresponding patches of input pixels in I
(thus preserving overall image layout). C may be down-
sampled relative to I and may contain multiple layers k. We
define a feature fVðIÞ as a weighted sum of the channel
C ¼ VðIÞ: fVðIÞ ¼
P
ijk wijkCði; j; kÞ. Numerous local and
global features can be written in this form including gradi-
ent histograms, linear filters, color statistics, and others [29].
Any such low-level shift invariant V can be used, making
this representation quite general.
Let Is denote I at scale s, where the dimensions hs  ws
of Is are s times the dimensions of I. For s 1, Is (which
denotes a higher resolution version of I) typically differs
from I upsampled by s, while for s 1 an excellent
approximation of Is can be obtained by downsampling I.
Next, for simplicity we redefine fVðIsÞ as1
fVðIsÞ
1
hswsk
X
ijk
Csði; j; kÞ where Cs ¼ VðIsÞ: (2)
In other words fVðIsÞ denotes the global mean of Cs com-
puted over locations ij and layers k. Everything in the fol-
lowing derivations based on global means also holds for
local means (e.g., local histograms).
Our goal is to understand how fVðIsÞ behaves as a func-
tion of s for any choice of shift invariant V.
4.1 Power Law Governs Feature Scaling
Ruderman and Bialek [33], [67] explored how the statistics
of natural images behave as a function of the scale at which
an image ensemble was captured, i.e., the visual angle cor-
responding to a single pixel. Let fðIÞ denote an arbitrary
(scalar) image statistic and E½ÁŠ denote expectation over an
ensemble of natural images. Ruderman and Bialek made
the fundamental discovery that the ratio of E½fðIs1
ÞŠ to
E½fðIs2
ÞŠ, computed over two ensembles of natural images
captured at scales s1 and s2, respectively, depends only on
the ratio of s1=s2 and is independent of the absolute scales
s1 and s2 of the ensembles.
Ruderman and Bialek’s findings imply that E½fðIsÞŠ fol-
lows a power law2
:
E½fðIs1
ÞŠ=E½fðIs2
ÞŠ ¼ ðs1=s2ÞÀf
: (3)
Every statistic f will have its own corresponding f. In the
context of our work, for any channel type V we can use the
scalar fVðIÞ in place of fðIÞ and V in place of f. While
Eq. (3) gives the behavior of fV w.r.t. to scale over an
ensemble of images, we are interested in the behavior of fV
for a single image.
We observe that a single image can itself be considered
an ensemble of image patches (smaller images). Since V is
shift invariant, we can interpret fVðIÞ as computing the
average of fVðIk
Þ over every patch Ik
of I and therefore
Eq. (3) can be applied directly for a single image. We formal-
ize this below.
We can decompose an image I into K smaller images
I1
. . . IK
such that I ¼ ½I1
Á Á Á IK
Š. Given that V must be shift
invariant and ignoring boundary effects gives VðIÞ ¼
Vð½I1
Á Á Á IK
ŠÞ % ½VðI1
Þ Á Á Á VðIK
ÞŠ, and substituting into
Eq. (2) yields fVðIÞ % SfVðIk
Þ=K. However, we can con-
sider I1
Á Á Á IK
as a (small) image ensemble, and fVðIÞ %
E½fVðIk
ÞŠ an expectation over that ensemble. Therefore,
substituting fVðIs1
Þ % E½fVðIk
s1
ÞŠ and fVðIs2
Þ % E½fVðIk
s2
ÞŠ
into Eq. (3) yields:
fVðIs1
Þ=fVðIs2
Þ ¼ ðs1=s2ÞÀV þ E ; (4)
where we use E to denote the deviation from the power law
for a given image. Each channel type V has its own corre-
sponding V, which we can determine empirically.
In Section 4.2 we show that on average Eq. (4) provides a
remarkably good fit for multiple channel types and image
sets (i.e., we can fit V such that E½EŠ % 0). Additionally,
experiments in Section 4.3 indicate that the magnitude of
deviation for individual images, E½E2
Š, is reasonable and
increases only gradually as a function of s1=s2.
4.2 Estimating
We perform a series of experiments to verify Eq. (4) and
estimate V for numerous channel types V.
To estimate V for a given V, we first compute:
ms ¼
1
N
XN
i¼1
fV
À
Ii
s
Á
=fV
À
Ii
1
Á
; (5)
for N images Ii
and multiple values of s 1, where Ii
s is
obtained by downsampling Ii
1 ¼ Ii
. We use two image
ensembles, one of N ¼ 1; 237 pedestrian images and one of
N ¼ 4; 280 natural images (for details see Section 3.1).
According to Eq. (4), ms ¼ sÀV þ E½EŠ. Our goal is to fit V
accordingly and verify the fidelity of Eq. (4) for various
channel types V (i.e., verify that E½EŠ % 0).
For each V, we measure ms according to Eq. (5) across
three octaves with eight scales per octave for a total of 24
measurements at s ¼ 2À1
8; . . . ; 2À24
8 . Since image dimensions
are rounded to the nearest integer, we compute and use
s0
¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
hsws=hw
p
, where h  w and hs  ws are the dimensions
of the original and downsampled images, respectively.
In Fig. 3 we plot ms versus s0
using a log-log plot for six
channel types for both the pedestrian and natural images.3
In all cases ms follows a power law with all measurements
falling along a line on the log-log plots, as predicted. How-
ever, close inspection shows ms does not start exactly at 1 as
expected: downsampling introduces a minor amount of
blur even for small downsampling factors. We thus expect
1. The definition of fVðIsÞ in Eq. (2) differs from our previous defini-
tion in [39], where fðI; sÞ denoted the channel sum after resampling by
2s
. The new definition and notation allow for a cleaner derivation, and
the exponential scaling law becomes a more intuitive power law.
2. Let FðsÞ ¼ E½fðIsÞŠ. We can rewrite the observation by saying
there exists a function R such that Fðs1Þ=Fðs2Þ ¼ Rðs1=s2Þ. Applying
repeatedly gives Fðs1Þ=Fð1Þ ¼ Rðs1Þ, Fð1Þ=Fðs2Þ ¼ Rð1=s2Þ, and
Fðs1Þ=Fðs2Þ ¼ Rðs1=s2Þ. Therefore Rðs1=s2Þ ¼ Rðs1ÞRð1=s2Þ. Next, let
R0
ðsÞ ¼ Rðes
Þ and observe that R0
ðs1 þ s2Þ ¼ R0
ðs1ÞR0
ðs2Þ since
Rðs1s2Þ ¼ Rðs1ÞRðs2Þ. If R0
is also continuous and non-zero, then it
must take the form R0
ðsÞ ¼ eÀs
for some constant [68]. This implies
RðsÞ ¼ R0
ðlnðsÞÞ ¼ eÀ lnðsÞ
¼ sÀ
. Therefore, E½fðIsÞŠ must follow a
power law (see also Eq. (9) in [67]).
3. Fig. 3 generalizes the results shown in Fig. 1. However, by switch-
ing from channel sums to channel means, m1=2 in Figs. 3a and 3b is 4Â
larger than m in Figs. 1b and 1c, respectively.
1536 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 8, AUGUST 2014
6. ms to have the form ms ¼ aVsÀV , with aV 6¼ 1 as an artifact
of the interpolation. Note that aV is only necessary for esti-
mating V from downsampled images and is not used sub-
sequently. To estimate aV and V, we use a least squares fit
of log2ðms0 Þ ¼ a0
V À V log2ðs0
Þ to the 24 measurements com-
puted over natural images (and set aV ¼ 2a0
V ). Resulting
estimates of V are given in plot legends in Fig. 3.
There is strong agreement between the resulting best-fit
lines and the observations. In legend brackets in Fig. 3 we
report expected error jE½EŠj ¼ jms À aVsÀV j for both natu-
ral and pedestrian images averaged over s (using aV and
V estimated using natural images). For basic gradient his-
tograms jE½EŠj ¼ 0:018 for natural images and jE½EŠj ¼
0:037 for pedestrian images. Indeed, for every channel type
Eq. (4) is an excellent fit to the observations ms for both
image ensembles.
The derivation of Eq. (4) depends on the distribution
of image statistics being stationary with respect to scale;
that this holds for all channel types tested, and with
nearly an identical constant for both pedestrian and
natural images, shows the estimate of V is robust and
generally applicable.
4.3 Deviation for Individual Images
In Section 4.2 we verified that Eq. (4) holds for an ensem-
ble of images; we now examine the magnitude of devia-
tion from the power law for individual images. We study
the effect this has in the context of object detection in
Section 6.
Plots of fVðIs1
Þ=fVðIs2
Þ for randomly selected images are
shown as faint gray lines in Fig. 3. The individual curves are
relatively smooth and diverge only somewhat from the
best-fit line. We quantify their deviation by defining ss anal-
ogously to ms in Eq. (5):
ss ¼ stdev½fVðIi
sÞ=fVðIi
1ÞŠ ¼ stdev½EŠ; (6)
where ‘stdev’ denotes the sample standard deviation (com-
puted over N images) and E is the error associated with
each image and scaling factor as defined in Eq. (4). In
Section 4.2 we confirmed that E½EŠ % 0, our goal now is to
understand how ss ¼ stdev½EŠ %
ffiffiffiffiffiffiffiffiffi
E½E2Š
p
behaves.
In Fig. 4 we plot ss as a function of s for the same chan-
nels as in Fig. 3. In legend brackets we report ss for s ¼ 1
2 for
both natural and pedestrian images; for all channels studied
s1=2 :2. In all cases ss increases gradually with increasing
s and the deviation is low for small s. The expected magni-
tude of E varies across channels, for example histograms of
normalized gradients (Fig. 4b) have lower ss than their
unnormalized counterparts (Fig. 4a). The trivial grayscale
channel (Fig. 4d) has ss ¼ 0 as the approximation is exact.
Observe that often ss is greater for natural images than
for pedestrian images. Many of the natural images contain
relatively little structure (e.g., a patch of sky), for such
images fVðIÞ is small for certain V (e.g., simple gradient his-
tograms) resulting in more variance in the ratio in Eq. (4).
For HOG channels (Fig. 4f), which have additional normali-
zation, this effect is minimized.
4.4 Miscellanea
We conclude this section with additional observations.
Interpolation method. Varying the interpolation algorithm
for image resampling does not have a major effect. In
Fig. 5a, we plot m1=2 and s1=2 for normalized gradient histo-
grams computed using nearest neighbor, bilinear, and bicu-
bic interpolation. In all three cases both m1=2 and s1=2 remain
essentially unchanged.
Window size. All preceding experiments were performed
on 128 Â 64 windows. In Fig. 5b we plot the effect of varying
the window size. While m1=2 remains relatively constant,
Fig. 3. Power law feature scaling. For each of six channel types we plot ms ¼ 1
N
P
fVðIi
sÞ=fVðIi
1Þ for s % 2À1
8; . . . ; 2À24
8 on a log-log plot for both pedes-
trian and natural image ensembles. Plots of fVðIs1
Þ=fVðIs2
Þ for 20 randomly selected pedestrian images are shown as faint gray lines. Additionally
the best-fit line to ms for the natural images is shown. The resulting and expected error jE½EŠj are given in the plot legends. In all cases the ms follow
a power law as predicted by Eq. (4) and are nearly identical for both pedestrian and natural images, showing the estimate of is robust and generally
applicable. The tested channels are: (a) histograms of gradients described in Section 3; (b) histograms of normalized gradients described in Section
3.3; (c) a difference of gaussian (DoG) filter (with inner and outer s of 0:71 and 1:14, respectively); (d) grayscale images (with ¼ 0 as expected); (e)
pixel standard deviation computed over local 5 Â 5 neighborhoods Cði;jÞ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
E½Iði;jÞ2ŠÀE½Iði;jÞŠ
p
; (f) HOG [21] with 4 Â 4 spatial bins (results were averaged
over HOG’s 36 channels). Code for generating such plots is available (see chnsScaling.m in Piotr’s Toolbox).
DOLLAR ET AL.: FAST FEATURE PYRAMIDS FOR OBJECT DETECTION 1537
7. s1=2 increases with decreasing window size (see also the
derivation of Eq. (4)).
Upsampling. The power law can predict features in higher
resolution images but not upsampled images. In practice,
though, we want to predict features in higher resolution as
opposed to (smooth) upsampled images.
Robust estimation. In preceding derivations, when com-
puting fVðIs1
Þ=fVðIs2
Þ we assumed that fVðIs2
Þ 6¼ 0. For the
V’s considered this was the case after windows of near uni-
form intensity were excluded (see Section 3.1). Alterna-
tively, we have found that excluding I with fVðIÞ % 0 when
estimating results in more robust estimates.
Sparse channels. For sparse channels where frequently
fVðIÞ % 0, e.g., the output of a sliding-window object detec-
tor, s will be large. Such channels may not be good candi-
dates for the power law approximation.
One-shot estimates. We can estimate as described in Sec-
tion 4.2 using a single image in place of an ensemble (N ¼ 1).
Such estimates are noisy but not entirely unreasonable; e.g.,
on normalized gradient histograms (with % 0:101) the
mean of 4,280 single image estimates of if 0:096 and the
standard deviation of the estimates is 0:073.
Scale range. We expect the power law to break down at
extreme scales not typically encountered under natural
viewing conditions (e.g., under high magnification).
5 FAST FEATURE PYRAMIDS
We introduce a novel, efficient scheme for computing fea-
ture pyramids. First, in Section 5.1 we outline an approach
for scaling feature channels. Next, in Section 5.2 we show its
application to constructing feature pyramids efficiently and
we analyze computational complexity in Section 5.3.
5.1 Feature Channel Scaling
We propose an extension of the power law governing fea-
ture scaling introduced in Section 4 that applies directly to
channel images. As before, let Is denote I captured at scale
s and RðI; sÞ denote I resampled by s. Suppose we have
computed C ¼ VðIÞ; can we predict the channel image
Cs ¼ VðIsÞ at a new scale s using only C?
The standard approach is to compute Cs ¼ VðRðI; sÞÞ,
ignoring the information contained in C ¼ VðIÞ. Instead,
we propose the following approximation:
Cs % RðC; sÞ Á sÀV : (7)
A visual demonstration of Eq. (7) is shown in Fig. 6.
Fig. 5. Effect of the interpolation algorithm and window size on channel
scaling. We plot m1=2 (bar height) and s1=2 (error bars) for normalized
gradient histograms (see Section 3.3) . (a) Varying the interpolation algo-
rithm for resampling does not have a major effect on either m1=2 or s1=2.
(b) Decreasing window size leaves m1=2 relatively unchanged but results
in increasing s1=2.
Fig. 6. Feature channel scaling. Suppose we have computed C ¼ VðIÞ;
can we predict Cs ¼ VðIsÞ at a new scale s? Top. the standard approach
is to compute Cs ¼ VðRðI; sÞÞ, ignoring the information contained in
C ¼ VðIÞ. Bottom. instead, based on the power law introduced in
Section 4, we propose to approximate Cs by RðC; sÞ Á sÀV . This
approach is simple, general, and accurate, and allows for fast feature
pyramid construction.
Fig. 4. Power law deviation for individual images. For each of the six channel types described in Fig. 3 we plot ss versus s where ss ¼
ffiffiffiffiffiffiffiffiffiffiffiffi
E½E2
Š
q
and E is
the deviation from the power law for a single image as defined in Eq. (4). In brackets we report s1=2 for both natural and pedestrian images. ss
increases gradually as a function of s, meaning that not only does Eq. (4) hold for an ensemble of images but also the deviation from the power law
for individual images is low for small s.
1538 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 8, AUGUST 2014
8. Eq. (7) follows from Eq. (4). Setting s1 ¼ s, s2 ¼ 1, and
rearranging Eq. (4) gives fVðIsÞ % fVðIÞsÀV . This relation
must hold not only for the original images but also for any
pair of corresponding windows ws and w in Is and I, respec-
tively. Expanding yields:
fV
À
Iws
s
Á
% fVðIw
ÞsÀV
1
jwsj
X
i;j2ws
Csði; jÞ %
1
jwj
X
i;j2w
Cði; jÞsÀV
Cs % RðC; sÞsÀV :
The final line follows because if for all corresponding
windows
P
ws
C0
=jwsj %
P
w C=jwj, then C0
% RðC; sÞ.
On a per-pixel basis, the approximation of Cs in Eq. (7)
may be quite noisy. The standard deviation ss of the ratio
fVðIws
s Þ=fVðIw
Þ depends on the size of the window w: ss
increases as w decreases (see Fig. 5b). Therefore, the accu-
racy of the approximation for Cs will improve if information
is aggregated over multiple pixels of Cs.
A simple strategy for aggregating over multiple pixels
and thus improving robustness is to downsample and/or
smooth Cs relative to Is (each pixel in the resulting channel
will be a weighted sum of pixels in the original full resolu-
tion channel). Downsampling Cs also allows for faster pyra-
mid construction (we return to this in Section 5.2). For
object detection, we typically downsample channels by 4Â
to 8Â (e.g., HOG [21] uses 8 Â 8 bins).
5.2 Fast Feature Pyramids
A feature pyramid is a multi-scale representation of an
image I where channels Cs ¼ VðIsÞ are computed at every
scale s. Scales are sampled evenly in log-space, starting at
s ¼ 1, with typically four to 12 scales per octave (an octave
is the interval between one scale and another with half or
double its value). The standard approach to constructing a
feature pyramid is to compute Cs ¼ VðRðI; sÞÞ for every s,
see Fig. 7 (top).
The approximation in Eq. (7) suggests a straightfor-
ward method for efficient feature pyramid construction.
We begin by computing Cs ¼ VðRðI; sÞÞ at just one scale
per octave (s 2 f1; 1
2 ; 1
4 ; . . .g). At intermediate scales, Cs is
computed using Cs % RðCs0 ; s=s0
Þðs=s0
ÞÀV where s0
2
f1; 1
2 ; 1
4 ; . . .g is the nearest scale for which we have
Cs0 ¼ VðIs0 Þ, see Fig. 7 (bottom).
Computing Cs ¼ VðRðI; sÞÞ at one scale per octave
provides a good tradeoff between speed and accuracy.
The cost of evaluating V is within 33 percent of comput-
ing VðIÞ at the original scale (see Section 5.3) and chan-
nels do not need to be approximated beyond half an
octave (keeping error low, see Section 4.3). While the
number of evaluations of R is constant (evaluations of
RðI; sÞ are replaced by RðC; sÞ), if each Cs is down-
sampled relative to Is as described in Section 5.1, evalu-
ating RðC; sÞ is faster than RðI; sÞ.
Alternate schemes, such as interpolating between two
nearby scales s0
for each intermediate scale s or evaluating
V more densely, could result in even higher pyramid accu-
racy (at increased cost). However, the proposed approach
proves sufficient for object detection (see Section 6).
5.3 Complexity Analysis
The computational savings of computing approximate fea-
ture pyramids is significant. Assume the cost of computing
V is linear in the number of pixels in an n  n image (as is
often the case). The cost of constructing a feature pyramid
with m scales per octave is
X1
k¼0
n2
2À2k=m
¼ n2
X1
k¼0
ð4À1=m
Þk
¼
n2
1 À 4À1=m
%
mn2
ln4
: (8)
The second equality follows from the formula for a sum of a
geometric series; the last approximation is valid for large m
(and follows from l’H^opital’s rule). In the proposed
approach we compute V once per octave (m ¼ 1). The total
cost is 4
3 n2
, which is only 33 percent more than the cost of
computing single scale features. Typical detectors are evalu-
ated on eight to 12 scales per octave [31], thus according to
(8) we achieve an order of magnitude savings over comput-
ing V densely (and intermediate Cs are computed efficiently
through resampling afterward).
6 APPLICATIONS TO OBJECT DETECTION
We demonstrate the effectiveness of fast feature pyramids
in the context of object detection with three distinct detec-
tion frameworks. First, in Section 6.1 we show the efficacy
of our approach with a simple yet state-of-the-art pedestrian
detector we introduce in this work called Aggregated Channel
Features (ACF). In Section 6.2 we describe an alternate
approach for exploiting approximate multiscale features
using integral images computed over the same channels
(integral channel features or ICF), much as in our previous
work [29], [39]. Finally, in Section 6.3 we approximate HOG
feature pyramids for use with deformable part models [35].
6.1 Aggregated Channel Features
The ACF detection framework is conceptually straightfor-
ward (Fig. 8). Given an input image I, we compute several
Fig. 7. Fast feature pyramids. Color and grayscale icons represent
images and channels; horizontal and vertical arrows denote computation
of R and V. Top. The standard pipeline for constructing a feature pyra-
mid requires computing Is ¼ RðI; sÞ followed by Cs ¼ VðIsÞ for every s.
This is costly. Bottom. We propose computing Is ¼ RðI; sÞ and
Cs ¼ VðIsÞ for only a sparse set of s (once per octave). Then, at interme-
diate scales Cs is computed using the approximation in Eq. (7):
Cs % RðCs0 ; s=s0
Þðs=s0
ÞÀV where s0
is the nearest scale for which we
have Cs0 ¼ VðIs0 Þ. In the proposed scheme, the number of computations
of R is constant while (more expensive) computations of V are reduced
considerably.
DOLLAR ET AL.: FAST FEATURE PYRAMIDS FOR OBJECT DETECTION 1539
9. channels C ¼ VðIÞ, sum every block of pixels in C, and
smooth the resulting lower resolution channels. Features
are single pixel lookups in the aggregated channels. Boost-
ing is used to train and combine decision trees over these
features (pixels) to distinguish object from background and
a multiscale sliding-window approach is employed. With
the appropriate choice of channels and careful attention to
design, ACF achieves state-of-the-art performance in pedes-
trian detection.
Channels. ACF uses the same channels as [39]: normal-
ized gradient magnitude, histogram of oriented gradients
(six channels), and LUV color channels. Prior to computing
the 10 channels, I is smoothed with a [1 2 1]/4 filter. The
channels are divided into 4 Â 4 blocks and pixels in each
block are summed. Finally the channels are smoothed, again
with a [1 2 1]/4 filter. For 640 Â 480 images, computing the
channels runs at over 100 fps on a modern PC. The code is
optimized but runs on a single CPU; further gains could be
obtained using multiple cores or a GPU as in [30].
Pyramid. Computation of feature pyramids at octave-
spaced scale intervals runs at $75 fps on 640 Â 480 images.
Meanwhile, computing exact feature pyramids with eight
scales per octave slows to $15 fps, precluding real-time
detection. In contrast, our fast pyramid construction (see
Section 5) with seven of eight scales per octave approxi-
mated runs at nearly 50 fps.
Detector. For pedestrian detection, AdaBoost [69] is used
to train and combine 2,048 depth-two trees over the
128 Á 64 Á 10=16 ¼ 5;120 candidate features (channel pixel
lookups) in each 128 Â 64 window. Training with multiple
rounds of bootstrapping takes $10 minutes (a parallel
implementation reduces this to $3 minutes). The detector
has a step size of four pixels and eight scales per octave. For
640 Â 480 images, the complete system, including fast pyra-
mid construction and sliding-window detection, runs at
over 30 fps allowing for real-time uses (with exact feature
pyramids the detector slows to 12 fps).
Code. Code for the ACF framework is available online.4
For more details on the channels and detector used in ACF,
including exact parameter settings and training framework,
we refer users to the source code.
Accuracy. We report accuracy of ACF with exact and fast
feature pyramids in Table 1. Following the methodology of
[31], we summarize performance using the log-average
miss rate (MR) between 10À2
and 100
false positives per
image. Results are reported on four pedestrian data sets:
INRIA [21], Caltech [31], TUD-Brussels [36] and ETH [37].
MRs for 16 competing methods are shown. ACF outper-
forms competing approaches on nearly all datasets. When
averaged over the four data sets, the MR of ACF is 40 per-
cent with exact feature pyramids and 41 percent with fast
feature pyramids, a negligible difference, demonstrating the
effectiveness of our approach.
Speed. MR versus speed for numerous detectors is shown
in Fig. 10. ACF with fast feature pyramids runs at $32 fps.
The only two faster approaches are Crosstalk cascades [40]
and the VeryFast detector from Benenson et al. [30]. Their
additional speedups are based on improved cascade strate-
gies and combining multi-resolution models with a GPU
implementation, respectively, and are orthogonal to the
gains achieved by using approximate multiscale features.
Indeed, all the detectors that run at 5 fps and higher exploit
the power law governing feature scaling.
Pyramid parameters. Detection performance on INRIA [21]
with fast feature pyramids under varying settings is shown
in Fig. 11. The key result is given in Fig. 11a: when approxi-
mating seven of eight scales per octave, the MR for ACF is
0:169 which is virtually identical to the MR of 0:166 obtained
Fig. 8. Overview of the ACF detector. Given an input image I, we compute several channels C ¼ VðIÞ, sum every block of pixels in C, and smooth
the resulting lower resolution channels. Features are single pixel lookups in the aggregated channels. Boosting is used to learn decision trees over
these features (pixels) to distinguish object from background. With the appropriate choice of channels and careful attention to design, ACF achieves
state-of-the-art performance in pedestrian detection.
TABLE 1
MRs of Leading Approaches for Pedestrian Detection
on Four Data Sets
For ICF and ACF exact and approximate detection results are shown
with only small differences between them. For the latest pedestrian
detection results please see [32].4. Code: http://vision.ucsd.edu/ pdollar/toolbox/doc/index.html.
1540 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 8, AUGUST 2014
10. using the exact feature pyramid. Even approximating 15 of
every 16 scales increases MR only somewhat. Constructing
the channels without correcting for power law scaling, or
using an incorrect value of , results in markedly decreased
performance, see Fig. 11b. Finally, we observe that at least
eight scales per octave must be used for good performance
(Fig. 11c), making the proposed scheme crucial for achiev-
ing detection results that are both fast and accurate.
6.2 Integral Channel Features
Integral channel features [29] are a precursor to the ACF
framework described in Section 6.1. Both ACF and ICF
use the same channel features and boosted classifiers;
the key difference between the two frameworks is that
ACF uses pixel lookups in aggregated channels as fea-
tures while ICF uses sums over rectangular channel
regions (computed efficiently with integral images).
Accuracy of ICF with exact and fast feature pyramids is
shown in Table 1. ICF achieves state-of-the-art results: infe-
rior to ACF but otherwise outperforming most competing
approaches. The MR of ICF averaged over the four data sets
is 42 percent with exact feature pyramids and 45 percent
with fast feature pyramids. The gap of 3 percent is larger
than the 1 percent gap for ACF but still small. With fast fea-
ture pyramids ICF runs at $16 fps, see Fig. 10. ICF is slower
than ACF due to construction of integral images and more
expensive features (rectangular sums computed via integral
images versus single pixel lookups). For more details on
ICF, see [29], [39]. The variant tested here uses identical
channels to ACF.
Detection performance with fast feature pyramids under
varying settings is shown in Fig. 12. The plots mirror the
results shown in Fig. 11 for ACF. The key result is given in
Fig. 12a: when approximating seven of eight scales per
octave, the MR for ICF is 2 percent worse than the MR
obtained with exact feature pyramids.
The ICF framework allows for an alternate application
of the power law governing feature scaling: instead of
rescaling channels as discussed in Section 5, one can
instead rescale the detector. Using the notation from
Section 4, rectangular channel sums (features used in ICF)
can be written as AfVðIÞ, where A denotes rectangle area.
As such, Eq. (4) can be applied to approximate features at
nearby scales and given integral channel images computed
at one scale, detector responses can be approximated at
nearby scales. This operation can be implemented by
rescaling the detector itself, see [39]. As the approximation
degrades with increasing scale offsets, a hybrid approach
is to construct an octave-spaced feature pyramid followed
by approximating detector responses at nearby scales, see
Fig. 9. This approach was extended in [30].
6.3 Deformable Part Models
Deformable part models from Felzenszwalb et al. [35] are an
elegant approach for general object detection that have con-
sistently achieved top results on the PASCAL VOC chal-
lenge [38]. DPMs use a variant of HOG features [21] as their
image representation, followed by classification with linear
SVMs. An object model is composed of multiple parts, a
root model, and optionally multiple mixture components.
For details see [35].
Recent approaches for increasing the speed of DPMs
include work by Felzenszwalb et al. [44] and Pedersoli
et al. [45] on cascaded and coarse-to-fine deformable part
Fig. 10. Log-average miss rate on the INRIA pedestrian data set [21] versus frame rate on 640 Â 480 images for multiple detectors. Method runtimes
were obtained from [31], see also [31] for citations for detectors A-L. Numbers in brackets indicate MR/fps for select approaches, sorted by speed.
All detectors that run at 5 fps and higher are based on our fast feature pyramids; these methods are also the most accurate. They include: (M)
FPDW [39] which is our original implementation of ICF, (N) ICF [Section 6.2], (O) ACF [Section 6.1], (P) crosstalk cascades [40], and (Q) the Very-
Fast detector from Benenson et al. [30]. Both (P) and (Q) use the power law governing feature scaling described in this work; the additional speedups
in (P) and (Q) are based on improved cascade strategies, multi-resolution models and a GPU implementation, and are orthogonal to the gains
achieved by using approximate multiscale features.
Fig. 9. (a) A standard pipeline for performing multiscale detection is to
create a densely sampled feature pyramid. (b) Viola and Jones [27]
used simple shift and scale invariant features, allowing a detector to be
placed at any location and scale without relying on a feature pyramid. (c)
ICF can use a hybrid approach of constructing an octave-spaced feature
pyramid followed by approximating detector responses within half an
octave of each pyramid level.
DOLLAR ET AL.: FAST FEATURE PYRAMIDS FOR OBJECT DETECTION 1541
11. models, respectively. Our work is complementary as we
focus on improving the speed of pyramid construction.
The current bottleneck of DPMs is in the classification
stage, therefore pyramid construction accounts for only a
fraction of total runtime. However, if fast feature pyra-
mids are coupled with optimized classification schemes
[44], [45], DPMs have the potential to have more competi-
tive runtimes. We focus on demonstrating DPMs can
achieve good accuracy with fast feature pyramids and
leave the coupling of fast feature pyramids and optimized
classification schemes to practitioners.
DPM code is available online [35]. We tested pre-trained
DPM models on the 20 PASCAL 2007 categories using exact
HOG pyramids and HOG pyramids with nine of 10 scales
per octave approximated using our proposed approach.
Average precision (AP) scores for the two approaches,
denoted DPM and $DPM, respectively, are shown in
Table 2. The mean AP across the 20 categories is 26.6 percent
for DPMs and 24.5 percent for $DPMs. Using fast HOG fea-
ture pyramids only decreased mean AP 2 percent, demon-
strating the validity of the proposed approach.
7 CONCLUSION
Improvements in the performance of visual recognition
systems in the past decade have in part come from the
realization that finely sampled pyramids of image fea-
tures provide a good front-end for image analysis. It is
widely believed that the price to be paid for improved
performance is sharply increased computational costs.
We have shown that this is not necessarily so. Finely
sampled pyramids may be obtained inexpensively by
extrapolation from coarsely sampled ones. This insight
decreases computational costs substantially.
Our insight ultimately relies on the fractal structure of
much of the visual world. By investigating the statistics of
natural images we have demonstrated that the behavior of
image features can be predicted reliably across scales. Our
calculations and experiments show that this makes it possi-
ble to estimate features at a given scale inexpensively by
extrapolating computations carried out at a coarsely sam-
pled set of scales. While our results do not hold under all
circumstances, for instance, on images of textures or white
noise, they do hold for images typically encountered in the
natural world.
In order to validate our findings we studied the perfor-
mance of three end-to-end object detection systems. We
found that detection rates are relatively unaffected while
computational costs decrease considerably. This has led to
the first detectors that operate at frame rate while using rich
feature representations.
Our results are not restricted to object detection nor to
visual recognition. The foundations we have developed
should readily apply to other computer vision tasks where
a fine-grained scale sampling of features is necessary as the
image processing front end.
Fig. 12. Effect of parameter setting of fast feature pyramids on the ICF detector [Section 6.2]. The plots mirror the results shown in Fig. 11 for the ACF
detector, although overall performance for ICF is slightly lower. (a) When approximating seven of every eight scales in the pyramid, the MR for ICF is
0:195 which is only slightly worse than the MR of 0:176 obtained using exact feature pyramids. (b) Computing approximate channels with an incorrect
value of results in decreased performance (although using a slightly larger than predicted appears to improve results marginally). (c) Similarly to
the ACF framework, at least eight scales per octave are necessary to achieve good results.
Fig. 11. Effect of parameter setting of fast feature pyramids on the ACF detector [Section 6.1]. We report log-average miss rate averaged over 25 tri-
als on the INRIA pedestrian data set [21]. Orange diamonds denote default parameter settings: 7/8 scales approximated per octave, % 0:17 for the
normalized gradient channels, and eight scales per octave in the pyramid. (a) The MR stays relatively constant as the fraction of approximated scales
increases up to 7/8 demonstrating the efficacy of the proposed approach. (b) Sub-optimal values of when approximating the normalized gradient
channels cause a marked decrease in performance. (c) At least eight scales per octave are necessary for good performance, making the proposed
scheme crucial for achieving detection results that are both fast and accurate.
1542 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 8, AUGUST 2014
12. ACKNOWLEDGMENTS
The authors thank Peter Welinder and Rodrigo Benenson
for helpful comments and suggestions. Piotr Dollar, Ron
Appel, and Pietro Perona were supported by MURI-
ONR N00014-10-1-0933 and ARO/JPL-NASA Stennis
NAS7.03001. Ron Appel was also supported by NSERC
420456-2012 and The Moore Foundation. Serge Belongie
was supported by the US National Science Foundation
(NSF) CAREER Grant 0448615, MURI-ONR N00014-08-1-
0638 and a Google Research Award.
REFERENCES
[1] D. Hubel and T. Wiesel, “Receptive Fields and Functional Archi-
tecture of Monkey Striate Cortex,” J. Physiology, vol. 195, pp. 215-
243, 1968.
[2] C. Malsburg, “Self-Organization of Orientation Sensitive Cells in
the Striate Cortex,” Biological Cybernetics, vol. 14, no. 2, pp. 85-100,
1973.
[3] L. Maffei and A. Fiorentini, “The Visual Cortex as a Spatial Fre-
quency Analyser,” Vision Research, vol. 13, no. 7, pp. 1255-1267,
1973.
[4] P. Burt and E. Adelson, “The Laplacian Pyramid as a Compact
Image Code,” IEEE Trans. Comm., vol. 31, no. 4, pp. 532-540, Apr.
1983.
[5] J. Daugman, “Uncertainty Relation for Resolution in Space, Spa-
tial Frequency, and Orientation Optimized by Two-Dimensional
Visual Cortical Filters,” J. Optical Soc. Am. A: Optics, Image Science,
and Vision, vol. 2, pp. 1160-1169, 1985.
[6] J. Koenderink and A. Van Doorn, “Representation of Local Geom-
etry in the Visual System,” Biological Cybernetics, vol. 55, no. 6,
pp. 367-375, 1987.
[7] D.J. Field, “Relations between the Statistics of Natural Images
and the Response Properties of Cortical Cells,” J. Optical Soc. Am.
A: Optics, Image Science, and Vision, vol. 4, pp. 2379-2394, 1987.
[8] S. Mallat, “A Theory for Multiresolution Signal Decomposition:
The Wavelet Representation,” IEEE Trans. Pattern Analysis and
Machine Intelligence, vol. 11, no. 7, pp. 647-693, July 1989.
[9] P. Vaidyanathan, “Multirate Digital Filters, Filter Banks, Poly-
phase Networks, and Applications: A Tutorial,” Proc. IEEE,
vol. 78, no. 1, pp. 56-93, Jan. 1990.
[10] M. Vetterli, “A Theory of Multirate Filter Banks,” IEEE Trans.
Acoustics, Speech and Signal Processing, vol. 35, no. 3, pp. 356-372,
Mar. 1987.
[11] E. Simoncelli and E. Adelson, “Noise Removal via Bayesian
Wavelet Coring,” Proc. Int’l Conf. Image Processing (ICIP), vol. 1,
1996.
[12] W.T. Freeman and E.H. Adelson, “The Design and Use of Steer-
able Filters,” IEEE Trans. Pattern Analysis and Machine Intelligence,
vol. 13, no. 9, pp. 891-906, Sept. 1991.
[13] J. Malik and P. Perona, “Preattentive Texture Discrimination with
Early Vision Mechanisms,” J. Optical Soc. Am. A, vol. 7, pp. 923-
932, May 1990.
[14] D. Jones and J. Malik, “Computational Framework for Determin-
ing Stereo Correspondence from a Set of Linear Spatial Filters,”
Image and Vision Computing, vol. 10, no. 10, pp. 699-708, 1992.
[15] E. Adelson and J. Bergen, “Spatiotemporal Energy Models for the
Perception of Motion,” J. Optical Soc. Am. A, vol. 2, no. 2, pp. 284-
299, 1985.
[16] Y. Weiss and E. Adelson, “A Unified Mixture Framework for
Motion Segmentation: Incorporating Spatial Coherence and Esti-
mating the Number of Models,” Proc. IEEE Conf. Computer Vision
and Pattern Recognition (CVPR), 1996.
[17] L. Itti, C. Koch, and E. Niebur, “A Model of Saliency-Based Visual
Attention for Rapid Scene Analysis,” IEEE Trans. Pattern Analysis
and Machine Intelligence, vol. 20, no. 11, pp. 1254-1259, Nov. 1998.
[18] P. Perona and J. Malik, “Detecting and Localizing Edges Com-
posed of Steps, Peaks and Roofs,” Proc. Third IEEE Int’l Conf. Com-
puter Vision (ICCV), 1990.
[19] M. Lades, J. Vorbruggen, J. Buhmann, J. Lange, C. von der
Malsburg, R. Wurtz, and W. Konen, “Distortion Invariant Object
Recognition in the Dynamic Link Architecture,” IEEE Trans. Com-
puters, vol. 42, no. 3, pp. 300-311, Mar. 1993.
[20] D.G. Lowe, “Object Recognition from Local Scale-Invariant
Features,” Proc. Seventh IEEE Int’l Conf. Computer Vision (ICCV),
1999.
[21] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for
Human Detection,” Proc. IEEE Conf. Computer Vision and Pattern
Recognition (CVPR), 2005.
[22] R. De Valois, D. Albrecht, and L. Thorell, “Spatial Frequency
Selectivity of Cells in Macaque Visual Cortex,” Vision Research,
vol. 22, no. 5, pp. 545-559, 1982.
[23] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio, “Gradient-Based
Learning Applied to Document Recognition,” Proc. IEEE, vol. 86,
no. 11, pp. 2278-2324, Nov. 1998.
[24] M. Riesenhuber and T. Poggio, “Hierarchical Models of Object
Recognition in Cortex,” Nature Neuroscience, vol. 2, pp. 1019-1025,
1999.
[25] D.G. Lowe, “Distinctive Image Features from Scale-Invariant Key-
points,” Int’l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.
[26] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet Classifica-
tion with Deep Convolutional Neural Networks,” Proc. Advances
in Neural Information Processing Systems (NIPS), 2012.
[27] P. Viola and M. Jones, “Rapid Object Detection Using a Boosted
Cascade of Simple Features,” Proc. IEEE Conf. Computer Vision and
Pattern Recognition (CVPR), 2001.
[28] P. Viola, M. Jones, and D. Snow, “Detecting Pedestrians Using
Patterns of Motion and Appearance,” Int’l J. Computer Vision,
vol. 63, no. 2, pp. 153-161, 2005.
[29] P. Dollar, Z. Tu, P. Perona, and S. Belongie, “Integral Channel
Features,” Proc. British Machine Vision Conf. (BMVC), 2009.
[30] R. Benenson, M. Mathias, R. Timofte, and L. Van Gool,
“Pedestrian Detection at 100 Frames per Second,” Proc. IEEE Conf.
Computer Vision and Pattern Recognition (CVPR), 2012.
[31] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian Detec-
tion: An Evaluation of the State of the Art,”IEEE Trans. Pattern
Analysis and Machine Intelligence, vol. 34, no. 4, pp. 743–761, 2012.
[32] www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/,
2014.
[33] D.L. Ruderman and W. Bialek, “Statistics of Natural Images: Scal-
ing in the Woods,” Physical Rev. Letters, vol. 73, no. 6, pp. 814-817,
Aug. 1994.
[34] E. Switkes, M. Mayer, and J. Sloan, “Spatial Frequency Analysis of
the Visual Environment: Anisotropy and the Carpentered Envi-
ronment Hypothesis,” Vision Research, vol. 18, no. 10, pp. 1393-
1399, 1978.
[35] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan,
“Object Detection with Discriminatively Trained Part Based Mod-
els,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32,
no. 9, pp. 1627-1645, Sept. 2010.
[36] C. Wojek, S. Walk, and B. Schiele, “Multi-Cue Onboard Pedestrian
Detection,” Proc. IEEE Conf. Computer Vision and Pattern Recogni-
tion (CVPR), 2009.
[37] A. Ess, B. Leibe, and L. Van Gool, “Depth and Appearance for
Mobile Scene Analysis,” Proc. IEEE 11th Int’l Conf. Computer Vision
(ICCV), 2007.
[38] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A.
Zisserman, “The PASCAL Visual Object Classes (VOC)
Challenge,” Int’l J. Computer Vision, vol. 88, no. 2, pp. 303-338,
June 2010.
[39] P. Dollar, S. Belongie, and P. Perona, “The Fastest Pedestrian
Detector in the West,” Proc. British Machine Vision Conf. (BMVC),
2010.
[40] P. Dollar, R. Appel, and W. Kienzle, “Crosstalk Cascades for
Frame-Rate Pedestrian Detection,” Proc. 12th European Conf. Com-
puter Vision (ECCV), 2012.
TABLE 2
Average Precision Scores for Deformable Part Models
with Exact (DPM) and Approximate ($DPM)
Feature Pyramids on PASCAL
DOLLAR ET AL.: FAST FEATURE PYRAMIDS FOR OBJECT DETECTION 1543
13. [41] T. Lindeberg, “Scale-Space for Discrete Signals,” IEEE Trans. Pat-
tern Analysis and Machine Intelligence, vol. 12, no. 3, pp. 234-254,
Mar. 1990.
[42] J.L. Crowley, O. Riff, and J.H. Piater, “Fast Computation of Char-
acteristic Scale Using a Half-Octave Pyramid,” Proc. Fourth Int’l
Conf. Scale-Space Theories in Computer Vision, 2002.
[43] R.S. Eaton, M.R. Stevens, J.C. McBride, G.T. Foil, and M.S. Snorra-
son, “A Systems View of Scale Space,” Proc. IEEE Int’l Conf. Com-
puter Vision Systems (ICVS), 2006.
[44] P. Felzenszwalb, R. Girshick, and D. McAllester, “Cascade Object
Detection with Deformable Part Models,” Proc. IEEE Conf. Com-
puter Vision and Pattern Recognition (CVPR), 2010.
[45] M. Pedersoli, A. Vedaldi, and J. Gonzalez, “A Coarse-to-Fine
Approach for Fast Deformable Object Detection,” Proc. IEEE Conf.
Computer Vision and Pattern Recognition (CVPR), 2011.
[46] C.H. Lampert, M.B. Blaschko, and T. Hofmann, “Efficient Sub-
window Search: A Branch and Bound Framework for Object
Localization,” IEEE Trans. Pattern Analysis and Machine Intelligence,
vol. 31, no. 12, pp. 2129-2142, Dec. 2009.
[47] L. Bourdev and J. Brandt, “Robust Object Detection via Soft
Cascade,” Proc. IEEE Conf. Computer Vision and Pattern Recognition
(CVPR), 2005.
[48] C. Zhang and P. Viola, “Multiple-Instance Pruning for Learning
Efficient Cascade Detectors,” Proc. Advances in Neural Information
Processing Systems (NIPS), 2007.
[49] J. Sochman and J. Matas, “Waldboost—Learning for Time Con-
strained Sequential Detection,” Proc. IEEE Conf. Computer Vision
and Pattern Recognition (CVPR), 2005.
[50] H. Masnadi-Shirazi and N. Vasconcelos, “High Detection-Rate
Cascades for Real-Time Object Detection,” Proc. IEEE 11th Int’l
Conf. Computer Vision (ICCV), 2007.
[51] F. Fleuret and D. Geman, “Coarse-To-Fine Face Detection,” Int’l J.
Computer Vision, vol. 41, no. 1/2, pp. 85-107, 2001.
[52] P. Felzenszwalb and D. Huttenlocher, “Efficient Matching of Pic-
torial Structures,” Proc. IEEE Conf. Computer Vision and Pattern Rec-
ognition (CVPR), 2000.
[53] C. Papageorgiou and T. Poggio, “A Trainable System for Object
Detection,” Int’l J. Computer Vision, vol. 38, no. 1, pp. 15-33, 2000.
[54] M. Weber, M. Welling, and P. Perona, “Unsupervised Learning of
Models for Recognition,” Proc. European Conf. Computer Vision
(ECCV), 2000.
[55] S. Agarwal and D. Roth, “Learning a Sparse Representation for
Object Detection,” Proc. European Conf. Computer Vision (ECCV),
2002.
[56] R. Fergus, P. Perona, and A. Zisserman, “Object Class Recognition
by Unsupervised Scale-Invariant Learning,” Proc. IEEE Conf. Com-
puter Vision and Pattern Recognition (CVPR), 2003.
[57] B. Leibe, A. Leonardis, and B. Schiele, “Robust Object Detection
with Interleaved Categorization and Segmentation,” Int’l J. Com-
puter Vision, vol. 77, no. 1-3, pp. 259-289, May 2008.
[58] C. Gu, J.J. Lim, P. Arbelaez, and J. Malik, “Recognition Using
Regions,” Proc. IEEE Conf. Computer Vision and Pattern Recognition
(CVPR), 2009.
[59] B. Alexe, T. Deselaers, and V. Ferrari, “What Is an Object?” Proc.
IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2010.
[60] C. Wojek, G. Dorko, A. Schulz, and B. Schiele, “Sliding-Windows
for Rapid Object Class Localization: A Parallel Technique,” Proc.
30th DAGM Symp. Pattern Recognition, 2008.
[61] L. Zhang and R. Nevatia, “Efficient Scan-Window Based Object
Detection Using GPGPU,” Proc. Workshop Visual Computer Vision
on GPU’s (CVGPU), 2008.
[62] B. Bilgic, “Fast Human Detection with Cascaded Ensembles,”
master’s thesis, MIT, Feb. 2010.
[63] Q. Zhu, S. Avidan, M. Yeh, and K. Cheng, “Fast Human Detection
Using a Cascade of Histograms of Oriented Gradients,” Proc.
IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2006.
[64] F.M. Porikli, “Integral Histogram: A Fast Way to Extract Histo-
grams in Cartesian Spaces,” Proc. IEEE Conf. Computer Vision and
Pattern Recognition (CVPR), 2005.
[65] A. Ess, B. Leibe, K. Schindler, and L. Van Gool, “Robust Multi-Per-
son Tracking from a Mobile Platform,” IEEE Trans. Pattern Analy-
sis and Machine Intelligence, vol. 31, no. 10, pp. 1831-1846, Oct. 2009.
[66] M. Bajracharya, B. Moghaddam, A. Howard, S. Brennan, and L.H.
Matthies, “A Fast Stereo-Based System for Detecting and Tracking
Pedestrians from a Moving Vehicle,” Int’l J. Robotics Research,
vol. 28, pp. 1466-1485, 2009.
[67] D.L. Ruderman, “The Statistics of Natural Images,” Network: Com-
putation in Neural Systems, vol. 5, no. 4, pp. 517-548, 1994.
[68] S.G. Ghurye, “A Characterization of the Exponential Function,”
The Am. Math. Monthly, vol. 64, no. 4, pp. 255-257, 1957.
[69] J. Friedman, T. Hastie, and R. Tibshirani, “Additive Logistic
Regression: A Statistical View of Boosting,” Annals of Statistics,
vol. 38, no. 2, pp. 337-374, 2000.
[70] P. Sabzmeydani and G. Mori, “Detecting Pedestrians by Learning
Shapelet Features,” Proc. IEEE Conf. Computer Vision and Pattern
Recognition (CVPR), 2007.
[71] Z. Lin and L.S. Davis, “A Pose-Invariant Descriptor for Human
Detection and Segmentation,” Proc. 10th European Conf. Computer
Vision (ECCV), 2008.
[72] S. Maji, A. Berg, and J. Malik, “Classification Using Intersection
Kernel SVMs Is Efficient,” Proc. IEEE Conf. Computer Vision and
Pattern Recognition (CVPR), 2008.
[73] X. Wang, T.X. Han, and S. Yan, “An HOG-LBP Human Detector
with Partial Occlusion Handling,” Proc. IEEE Int’l Conf. Computer
Vision (ICCV), 2009.
[74] C. Wojek and B. Schiele, “A Performance Evaluation of Single and
Multi-Feature People Detection,” Proc. 30th DAGM Symp. Pattern
Recognition (DAGM), 2008.
[75] W. Schwartz, A. Kembhavi, D. Harwood, and L. Davis, “Human
Detection Using Partial Least Squares Analysis,” Proc. IEEE 12th
Int’l Conf. Computer Vision (ICCV), 2009.
[76] S. Walk, N. Majer, K. Schindler, and B. Schiele, “New Features and
Insights for Pedestrian Detection,” Proc. IEEE Conf. Computer
Vision and Pattern Recognition (CVPR), 2010.
[77] D. Park, D. Ramanan, and C. Fowlkes, “Multiresolution Models
for Object Detection,” Proc. 11th European Conf. Computer Vision
(ECCV), 2010.
Piotr Dollar received the master’s degree in
computer science from Harvard University in
2002 and the PhD degree from the University of
California, San Diego in 2007. He joined the
Computational Vision Lab at the California Insti-
tute of Technology Caltech as a postdoctoral fel-
low in 2007. Upon being promoted to a senior
postdoctoral fellow he realized it was time to
move on, and in 2011, he joined the Interactive
Visual Media Group at Microsoft Research,
Redmond, Washington, where he currently
resides. He has worked on object detection, pose estimation, boundary
learning, and behavior recognition. His general interests include
machine learning and pattern recognition, and their application to com-
puter vision.
Ron Appel received the bachelor’s and master’s
degrees in electrical and computer engineering
from the University of Toronto in 2006 and
2008, respectively. He is working toward the
PhD degree in the Computational VisionLab at
the California Institute of Technology, Caltech,
where he currently holds an NSERC Graduate
Award. He cofounded ViewGenie Inc., a com-
pany specializing in intelligent image processing
and search. His research interests include
machine learning, visual object detection, and
algorithmic optimization.
1544 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 8, AUGUST 2014
14. Serge Belongie received the BS (with honor) in
electrical engineering from the California Institute
of Technology, Caltech in 1995 and the PhD
degree in electrical engineering and computer
science from the University of California at Ber-
keley in 2000. While at the University of California
at Berkeley, his research was supported by the
US National Science Foundation (NSF) Graduate
Research Fellowship. From 2001 to 2013, he
was a professor in the Department of Computer
Science and Engineering at the University of Cal-
ifornia, San Diego (UCSD). He is currently a professor at Cornell NYC
Tech and the Cornell Computer Science Department, Ithaca, New York.
His research interests include computer vision, machine learning, crowd-
sourcing, and human-in-the-loop computing. He is also a cofounder of
several companies including Digital Persona, Anchovi Labs (acquired by
Dropbox) and Orpix. He is a recipient of the US National Science Foun-
dation (NSF) CAREER Award, the Alfred P. Sloan Research Fellowship,
and the MIT Technology Review “Innovators Under 35” Award.
Pietro Perona received the graduate degree
in electrical engineering from the Universita di
Padova in 1985 and the PhD degree in electri-
cal engineering and computer science from
the University of California at Berkeley in
1990. After a postdoctoral fellowship at MIT in
1990-1991 he joined the faculty of the Califor-
nia Institute of Technology, Caltech in 1991,
where he is now an Allen E. Puckett professor
of electrical engineering and computation and
neural systems. His current interests include
visual recognition, modeling vision in biological systems, modeling
and measuring behavior, and Visipedia. He has worked on aniso-
tropic diffusion, multiresolution-multiorientation filtering, human tex-
ture perception and segmentation, dynamic vision, grouping,
analysis of human motion, recognition of object categories, and
modeling visual search.
For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
DOLLAR ET AL.: FAST FEATURE PYRAMIDS FOR OBJECT DETECTION 1545