PEDESTRIAN DETECTION IN LOW RESOLUTION VIDEOS USING A MULTI-FRAME HOG-BASED DETECTOR

International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842
Issue 03, Volume 6 (March 2019) www.irjcs.com
_________________________________________________________________________________________________________________________________________________
IRJCS: Mendeley (Elsevier Indexed) CiteFactor Journal Citations Impact Factor 1.81 –SJIF: Innospace,
Morocco (2016): 4.281 Indexcopernicus: (ICV 2016): 88.80
© 2014-19, IRJCS- All Rights Reserved Page -55
PEDESTRIAN DETECTION IN LOW RESOLUTION
VIDEOS USING A MULTI-FRAME HOG-BASED
DETECTOR
Dr. Hisham Sager
Colorado School of Mines
hsager@mines.edu
Dr. William Hoff
Colorado School of Mines
whoff@mines.edu
Manuscript History
Number: IRJCS/RS/Vol.06/Issue03/MRCS10090
Received: 03, March 2019
Final Correction: 13, March 2019
Final Accepted: 21, March 2019
Published: March 2019
Citation: Sager & Hoff (2019). PEDESTRIAN DETECTION IN LOW RESOLUTION VIDEOS USING A MULTI-FRAME
HOG-BASED DETECTOR. IRJCS:: International Research Journal of Computer Science, Volume VI, 55-71.
doi://10.26562/IRJCS.2019.MRCS10090
Editor: Dr.A.Arul L.S, Chief Editor, IRJCS, AM Publications, India
Copyright: ©2019 This is an open access article distributed under the terms of the Creative Commons Attribution
License, Which Permits unrestricted use, distribution, and reproduction in any medium, provided the original
author and source are credited
Abstract-- Detecting pedestrians in low resolution videos is a challenging task, due to the small size of pedestrians
in the images and the limited information. In practical outdoor surveillance scenarios the pedestrian size is
usually small. Existing state-of-the-art pedestrian detection methods that use histogram of oriented gradient
(HOG) features have poor performance in this problem domain. To compensate for the lack of information in a
single frame, we propose a novel detection method that recognizes pedestrians in a short sequence of frames.
Namely, we take the single-frame HOG-based detector and extend it to multiple frames. Our detector is applied to
regions containing potential moving objects. In the case of video taken from a moving camera on an aerial
platform, video stabilization is first performed to register the frames. A classifier is then applied to features
extracted from spatio-temporal volumes surrounding the potential moving objects. On challenging stationary and
aerial video datasets, our detection accuracy outperforms several state-of-the-art algorithms.
I. INTRODUCTION
The task of detecting people in images and video is an important and active area in both academia and industry.
One scenario is the case of outdoor surveillance cameras, which are mounted in a high position and look down
upon a large public area such as a street or plaza. Another scenario is a camera mounted on a moving platform
such as a helicopter or unmanned aerial vehicle (UAV). In these scenarios, the size of people in the images is often
small, and detection can be challenging. In this paper, we focus on the problem of detecting people in low
resolution videos, where the height of the people is on the order of 20 pixels tall in the images. We also focus on
detecting pedestrians, by which we mean people that are walking.
This problem has many applications, including search and rescue, law enforcement, and border monitoring. When
the size of a pedestrian in an image becomes very small, many shape details are lost, and it is difficult to
distinguish a pedestrian from a non-pedestrian. Figure 1 shows an example of a pedestrian at four different
resolution levels. As described in Section 2, existing algorithms for pedestrian detection do fairly well for high
resolution images, but performance degrades dramatically when the height of pedestrians is 30 pixels or less. In
addition to the small size of the pedestrians in the images, the problem of detecting pedestrians in low resolution
video can be challenging for other reasons.

_________________________________________________________________________________________________________________________________________________
There can be a wide range of poses and appearance, including a variety of clothing. The lighting can vary and
shadows can be present. Background clutter can have a similar appearance to pedestrians. Pedestrians can be
partially occluded by other objects, or by other pedestrians.
Figure 1: An example pedestrian at four different resolution levels: The height of the pedestrian is 140, 50, 20, and
10 pixels. (Left) Images at actual size. (Right) The same images but stretched for visualization.
The problem is even more challenging in aerial videos. The effective resolution of the video is often degraded due
to motion blur and haze, further reducing the available visual information on shape and appearance. In these
scenarios, video stabilization is often used to compensate for camera motion, in order to help find moving objects
in the scene. However, stabilization is imperfect, especially in the case of rapidly moving cameras. As a result,
many false regions can be identified as moving objects. Another significant challenge is that the camera moves
around frequently and does not dwell for long on a particular portion of the scene. Thus, algorithms that rely on a
long sequence of observations to build up a motion track model may not be applicable. As an example, Figure 2
shows snapshots from a video taken from a small quadrotor UAV flying rapidly over a field [1]. The camera moves
erratically, undergoing large amplitude rotations and translations. As a result, people are usually only within the
field of view for a short time (as briefly as several seconds). Although the size of people varies due to the changing
altitude of the camera, the height of people is often as small as 20 pixels tall. Detecting people in these scenarios is
extremely difficult, due to the low resolution.
Figure 2: Example images from aerial video (4 seconds apart). Size of images is 380×640 pixels.
Although it is very difficult to recognize a person in a single low resolution image, the task is much easier when a
short sequence of images is used. For example, Figure 3 shows a single low resolution frame in which it is difficult
to recognize the object. The right portion of the figure is a sequence of frames in which a subject is performing a
recognizable movement; i.e., walking. Despite the deficiency of recognizable features in the static image, the
movement can be easily recognized when the sequence is put in motion on a screen. This phenomenon is well
known from the pioneering work of Johansson [2], whose moving light display experiments showed convincingly
that the human visual system can easily recognize people in image sequences, even though the images contained
only a few bright spots (attached to their joints). A static image of spots is meaningless to observers, while a
sequence of images creates a vivid perception of a person walking. The most common and successful existing
approaches for pedestrian detection, such as the Dalal detector [3], use histogram of oriented gradient (HOG)
features, with a support vector machine (SVM) classifier. However, these approaches perform poorly when the
height of pedestrians is 30 pixels or less [4; 37].

_________________________________________________________________________________________________________________________________________________
Figure 3: (Left) Single frame. (Right) Sequence of frames from a video of a walking person.
To compensate for the lack of information in a single frame, we propose a novel detection method that recognizes
pedestrians in short sequence of frames. Namely, we take the single-frame HOG-based detector and extend it to
multiple frames. Our approach (Section 3) uses HOG features extracted from a spatiotemporal volume of images.
We use a volume composed of up to 32 “slices”, where each slice is a small sub image of 32×32 pixels. This volume
represents duration of about one second or less. The idea is that the motion of a person walking is distinctive, and
we can train a classifier to recognize the temporal sequence of feature vectors within the volume. As an example,
consider the images of a walking person shown in the top row of Figure 4. The corresponding HOG features are
shown in the second row. The third row shows images of a moving car. The sequence of corresponding HOG
features (bottom row) of the negative example is visually quite different from that of the positive example.
The main contribution of this work is the development of a novel multi-frame HOG-based pedestrian detector that
is able to detect pedestrians in video, at lower resolutions than has been reported in previous work. Our detector
achieves significantly better accuracy than existing detectors on challenging video datasets. The rest of this paper
is organized as follows. We discuss related work in Section 2. Our pedestrian detection method is presented in
Section 3. A detailed description of experimental results is presented in Section 4. Section 5 summarizes
conclusions and future work.
(a)
(b)
Figure 4: (a) Positive example (pedestrian). (b) Negative example (part of a car passing by a post)
II. RELATED WORK
There is an extensive body of literature on people detection, although there is less work on pedestrian detection in
low-resolution videos, and relatively little work on pedestrian detection in aerial videos. Comprehensive reviews
can be found in [4; 5; 6; 7]. Most work focuses on pedestrian detection in single high-resolution images. Instead of
an explicit model, an implicit representation is learned from examples, using machine learning techniques. These
approaches typically extract features from the image and then apply a classifier to decide if the image contains a
person. Typically, the detection system is applied to sub-images over the entire image, using a sliding window
approach. A multi-scale approach can be used, to handle different sizes of the person in the window. Alternatively,
the detection system can be preceded by a region-of-interest selector, which generates initial object hypotheses,
using some simple and fast tests. Then the full person detection system is applied to the candidate windows. The
most common and successful approaches for single frame pedestrian detection use gradient-based features. The
Dalal-Triggs detector [3] used a histogram of oriented gradient (HOG) features, with a support vector machine
(SVM) classifier. A model for the shape of a person is learned from many training examples. The HOG + SVM
approach is still considered as a competitive baseline in pedestrian detection.

_________________________________________________________________________________________________________________________________________________
Although this approach has excellent performance in high resolution images, studies [4] have shown that
performance degrades dramatically when the height of pedestrians is 30 pixels or less. A variation on this
approach is to use deformable part models for detection [9]. The part models are related to a root model using a
set of deformation models representing the expected location of body parts. Although this approach can handle a
wider variety of poses, Park, et al [10] found that the part-based model is not useful for pedestrian heights less
than 90 pixels. Same is applied to some recently used approaches such as deep convolutional neural networks
(CNNs) which have been widely adopted for pedestrian detection [38] and achieved state-of-the-art performance,
but not for low-resolution applications. Recently, a wave of deep CNNs based pedestrian detectors have achieved
good performance on several high-quality pedestrian benchmarks [41; 42] which is not the case for the low
resolution applications targeted by our work.
Contextual information can improve recognition, since in traffic scenes pedestrians are often around vehicles [12;
13; 39]. The approach of [40] proposes a segmentation and context network (SCN) structure that combines the
segmentation and context information for improving the accuracy of pedestrian detection. Our work does not use
contextual information, since we wanted to make our approach more general and not limit our domain to traffic
scenes. Other approaches use features that are similar to Haar wavelets [14; 15; 16]. Viola and Jones [14]
popularized this approach and showed its applicability to face detection. The features are differences of
rectangular regions in the images. These are simple and very fast to compute. Although each feature is not very
discriminatory, a large number of features can to be chained together to achieve good performance. The method of
AdaBoost is used to train the classifier and select features. In [15] Viola and Jones use Haar-like wavelets to
compute features in pairs of successive images for pedestrian detection.
Jones and Snow [16] extended the above algorithm to make use of 10 images in a sequence. This algorithm is the
closest one to our approach, since it uses a relatively long sequence. They used two types of Haar-like features:
Features applied within each frame, and differences of features between two different frames. On the PETS2001
dataset [17], their detector achieves a detection rate from 84% to 93% with a FP rate of 10-6. They were able to
detect pedestrians down to a size of 20 pixels tall, in videos taken from stationary cameras. To get better
performance, one might try to extend the Jones and Snow method to work on longer sequences of images.
However, in this case the number of potential Haar-like features grows to an unmanageable amount. Because of
the large number of feature hypotheses that need to be examined at each stage, the training time can be quite slow
(in the order of weeks).
Other approaches also use the additional information provided by image sequences to improve detection. For
example, [8] uses a two stage classifier which uses the detection scores from previous images to improve the
classification in the current image. Optical flow information can be incorporated into a feature vector along with
image gradient information [18; 19]. In [20] gradient weighted optical flow from the first frame of the sequence to
detect objects (face or person), then it is convolved with magnitude of gradient for further tracking. Other work
[11; 22; 23] extract spatiotemporal gradient features from the spatiotemporal volume of images. These methods
were developed to recognize actions in videos. Conceivably these approaches could also be used to detect
pedestrians. However, in low resolution image sequences, it would be difficult to extract local features, since the
volume is so small.
The work of [21] proposed 3DHOG descriptor to characterize the features of motion with a co-occurrence spatio-
temporal vectors. To increase discrimination, HOG HOF (HOG Histogram of Optical Flow) and the STHOG (Spatio-
temporal HOG) descriptors are proposed at the price of very high computation cost. The optical flow-based
features appear to help in high resolution [4], but in low-resolution scenarios, detection results are poor due to
noise, camera jitter, and the limited number of pixels available. There is relatively little work focused on
pedestrian detection in low resolution aerial videos. Most previous work using aerial video performs motion
compensation by registering each image to a reference image. In this way a short term background image can be
computed, which can be used to detect foreground objects using image subtraction (e.g., [24]). The work of [25]
applied a joint global–local information algorithm to suppress the background interference and enrich the
description of pedestrian. It is based on extracting features from human body parts, which are not available in low
resolution applications. Some approaches for pedestrian detection in aerial images use the same methods that
were discussed above, namely HOG-like features with an SVM classifier (e.g., [26]) or Haar-like features with an
AdaBoost classifier (e.g., [27]). Other approaches combine these features with additional information. For example
[28] uses shadows cast by people for classification. However, shadow information may not be reliable in low
resolution videos. The only approach that was found that uses image sequences for pedestrian detection in aerial
videos was [24]. This approach computes frequency measures of sub windows to detect the periodic motion of
legs, arms, etc. However, quantitative results were not presented. Some approaches integrate detection and
tracking. For example, [29] extracts hypotheses of body articulations from images; then combines those into
“tracklets” over a small number of frames, using a dynamic model of limbs and body parts.

_________________________________________________________________________________________________________________________________________________
This requires relatively high resolution images. Other approaches can detect very small pedestrians, but require a
longer sequence of frames. For example, the approach of [30] can detect and track vehicles and people only a few
pixels in size, but uses sequences ranging from tens of seconds to minutes long. The work of [31] checks 2D object
detections for consistency with scene geometry and converts them to 3D tracks. In videos taken from a rapidly
moving UAV, a pedestrian may only be visible in the field of view for a short time (a few seconds). Thus, methods
that require a long sequence of images to build up a track file in order to perform detection are not applicable. In
contrast, our system (described in the next section) only requires a short sequence of images and is applicable to
situations where the pedestrian is only visible for a short time.
III. PEDESTRIAN DETECTION METHOD
Figure 5 shows the architecture of the system, which contains two phases: a training phase and a detection phase.
In the training phase, positive training examples (i.e., volumes containing pedestrians) and negative training
examples (i.e., volumes not containing pedestrians) are created. Features are then extracted from these volumes.
In the detection phase, the binary classifier constructed during the training phase is used to scan over detected
ROIs in sequences of unseen testing images to search for pedestrians.
3.1 Video Stabilization
In the case of aerial video, where the images are taken from a moving camera, stabilization is applied to short
overlapping sequences of 32 frames. We start with the first frame of each sequence and use it as the reference
frame for the sequence. The remaining frames are registered to the first frame. The results are overlapping groups
of 32 frames, which are co-registered. This aids the next step, which is to detect ROIs containing potential moving
objects. In the case of videos taken from stationary cameras, the stabilization step can be skipped, since the
images are already co-registered. We assume that the camera is looking down at the ground, which is
approximately a planar surface. Thus, a homography (projective) transform describes the relationship between
any two images. To compute the transform, Harris corner interest points are matched between the reference
image and each subsequent image. We fit a homography to the matched points, using RANSAC to eliminate
outliers. We then apply the transform to align the current image to the reference image (Figure 6). The
assumption of a planar surface is only an approximation, although it is usually good if the camera is high above the
ground. However, any objects (such as buildings and trees) above the plane will be miss-registered, and may
result in ROIs that do not correspond to actual moving objects. Our classifier will subsequently filter these out,
since motion patterns within these ROIs do not match the patterns of a walking person.
3.2 ROI Detection
After stabilization, background subtraction is used to identify potential moving objects in the scene for subsequent
analysis.
Figure 5: The architecture of our overall pedestrian detection system.
A background model for each group of 32 frames is constructed by computing the mean of the frames. Then the
difference between the middle frame in the sequence and the background model is computed.

_________________________________________________________________________________________________________________________________________________
Initial foreground pixels are identified by thresholding the absolute value of the difference image. In our work, the
threshold was empirically set to 15−25 (depending on the image sequence). Morphological operations are applied
to eliminate small regions and join broken regions. The opening operation removes small objects from the
foreground, placing them in the background, while closing operation removes small holes in the foreground. A
disk-shaped structuring element is used with radius of size of 4 or more. Connected components whose area lies
between 20 pixels and 500 pixels are extracted and their bounding boxes become the final ROIs. An example of an
image containing the final detected ROIs is shown in Figure 7. Although simple in design, the ROI detector
performs well in detecting potential moving objects. We deliberately tuned it to be very sensitive. For example, in
the VIRAT aerial dataset (described in Section 4), only 49 out of 1607 actual moving objects were not detected,
and in the PETS 2001 dataset, only 88 out of 1929 actual moving objects were not detected. Of course, the ROI
detector also occasionally detects non-moving objects, due to image noise and miss-registration. The classifier
will subsequently filter out the non-moving objects (as well as non-pedestrians).
Figure 6: (Left) Reference frame (1st of sequence of 32 frames). (Right) The 16th frame in the sequence,
registered to the first.
Figure 7: An example of ROIs shown on: (Left) The binary image. (Right) The registered Image.
3.3 Formation of Spatiotemporal Volumes
A sliding window of size 32×32 pixels is scanned within each ROI detected above. At each position, a
spatiotemporal volume is created by extracting a sequence of sub images (slices), at a fixed position in the
registered images, for N frames (we used up to 32 frames). The slice window size was chosen to be 32×32 pixels.
This size is large enough so a pedestrian remains within the window throughout the sequence at normal walking
speeds, which usually corresponds to about ½ pixel per frame. Since our detector is trained to detect pedestrians
with a height of approximately 20 pixels, this allows a border of about 6 pixels above and below the person (Figure
8). To handle possible variations in scale, we extract volumes at multiple scales in the image sequence by creating
a pyramid of images of different sizes. A scale factor of 0.75 is used between levels of the pyramid (for a total of 6
pyramid levels). This allows us to detect people that are taller than 20 pixels – the detector will detect people at
the image level where the height is about 20 pixels.
3.4 Feature Extraction, Normalization, and Dimensionality Reduction
HOG features are then extracted from each of the slices that make up the volume. We divide each 32×32 pixel slice
into square cells (typically 4×4 pixels each), and compute a histogram of gradient directions in each cell. We use 9
bins for the gradient directions, which represent unsigned directions from 0°-180°. Following the method of [3],
cells are grouped into (possibly overlapping) blocks, where each block consists of 2×2 cells. The features from
each slice are then concatenated into a single large vector. Variations in illumination affect the magnitudes of the
gradients. The influence of large gradient magnitudes can be reduced using normalization, which can be
performed in input space or in feature space. We found out that normalization in input space has little or no effect
on performance and sometimes decreases the performance. So, normalization is performed in feature space.
Following the method of [32], we normalize the volumetric blocks using the L2-norm followed by clipping to limit
the maximum values (Lowe-style clipped L2-norm).

_________________________________________________________________________________________________________________________________________________
Figure 8: Spatiotemporal Volume. (Left) Positive example. (Middle) Gradient. (Right) Computed HOG, with
volumetric block (shown in red color), and cell (shown in yellow color)
The difference is that in our algorithm the features were normalized within each volumetric block, meaning that
the sequence of blocks across the N slices (e.g. 16 slices) is at the same place in each slice, as shown in Figure 9.
Next, the features from the volumetric block in all slices were concatenated into a single feature vector. The result
of the volumetric normalization step is a set of feature vectors that are better invariant to changes in illumination
or shadowing. Using lower dimensional features produces models with fewer parameters, which speeds up the
training and detection algorithms. Following the work of [9], we apply Principal Components Analysis (PCA) to the
feature victors to reduce the dimensionality of features. In the learning stage, we collect a large number of 36-
dimensional HOG features corresponding to blocks and perform PCA on them. The eigen values indicate that the
linear subspace spanned by the top 50% eigenvectors can capture the essential information in the features.
Figure 9: (a) Volumetric Block. (b) Spatiotemporal volume of 16 slices.
3.5 Classification
To search for pedestrians, we apply a classifier to each spatiotemporal volume within the detected ROIs. We first
train a support vector machine (SVM) on examples of positive (pedestrian) and negative (non-pedestrian) feature
vectors. To extract positive examples from the training videos, the following procedure was followed. A
pedestrian was manually selected in one of the images (in one of the detected ROIs) and a square sub window was
extracted from the image surrounding the pedestrian. This sub window was scaled such that the person was 20
pixels tall, and the sub window size was 32 × 32 pixels. Next, a sequence of sub windows was extracted from the
registered images; half of them proceeding and half of them following the central image, at the same fixed place in
all images, and the sub windows were similarly scaled. A total of 32 such slices were assembled into a
spatiotemporal volume, representing a single positive example (Figure 10). Negative examples were also
extracted from the training images. These were spatiotemporal volumes of the same size as the positive examples,
but sampled randomly from completely person-free areas of detected ROIs. We used a freely available SVM-based
classifier; the OSU-SVM MATLAB toolbox [33], which is based on LIBSVM [34].
Figure 10: Four different training example sequences. Top: two sequences from VIRAT [31]. Bottom: sequences
from UCF-2009 [32]. Each sequence contains 8 slices subsampled from a sequence of 32 slices.

_________________________________________________________________________________________________________________________________________________
K-fold cross-validation is used for parameter selection, by partitioning the training data into 5 equally sized
segments, and then performing iterations of training and validation to pick the best parameters for the SVM
kernels. We experimented with two kernels – a linear kernel and a radial basis function kernel. Although the non-
linear kernel gives slightly more accurate results, for simplicity and speed we use the linear kernel as the baseline
classifier throughout this study. Finally, we perform non-maximum suppression on the detections. If the bounding
boxes of two or more detections overlap by more than 50%, they are merged into one detection by averaging their
top left coordinates.
IV.DATASETS, EXPERIMENTS, AND RESULTS
The results in this section use the following default parameters: Blocks of size of 2×2 cells, with no overlap, and
each cell consists of 4×4 pixels. We use 9 bins for gradient orientations, and normalize volumetric blocks using a
clipped L2-norm. The size of image slices is 32×32 pixels and a linear SVM classifier is used. In addition to
evaluating our algorithm, we compare our results to two other algorithms: the Dalal-Triggs algorithm [3], which is
among the most popular approaches for single frame pedestrian detection, and the Jones and Snow algorithm [15],
which was the best performing algorithm on low resolution pedestrians that we found. If we limit our algorithm
to use only a single image, it is essentially the same as the Dalal-Triggs algorithm. Therefore, we can directly
compare the performance of our algorithm to that of the Dalal-Triggs algorithm on each of the datasets. Although
we did not have an implementation of the Jones and Snow algorithm, they give performance results on one of the
datasets that we used, so we can compare our algorithm to theirs on that dataset. For evaluation and comparison
we use the following standard metrics, based on the possible outcomes of True Positive (TP), True Negative (TN),
False Positive (FP), and False Negative (FN). The “Detection Rate” (DR), or True Positive Rate (TPR) measures
how accurate the classifier is in sensing pedestrians. It is the proportion of positive examples (pedestrians) that
were correctly identified. It is calculated as:
FNTP
TP
DR


PositivesTotal
ClassifiedCorrectlyPositives
The “False Positive Rate” (FPR) is the proportion of negative examples (non-pedestrians) that were incorrectly
classified as positives (pedestrians). It is also known as the false positives per window (FPPW) rate. It is
calculated as:
FPTN
FP
FPR


NegativesTotal
ClassifiedyIncorrectlNegatives
A “Receiver Operating Characteristic” (ROC) curve depicts the tradeoff between hit rates (DR) and false alarm
rates of a classifier, as some parameters of the algorithm are varied.
4.1 Datasets
For evaluation of the proposed approach, we used six datasets that are representative for our application. These
are two stationary camera datasets (PETS2001 [17], VIRAT public 1.0 [35]), and three aerial datasets (VIRAT Fort
AP Hill [35], UCF-2009 [32], UCF-2007 [36]). In addition, we created our own set of aerial video sequences of
natural scenes to be realistic. It simulates scenarios of search and rescue scene or a border monitoring scene. In
the stationary datasets, videos were collected from a stationary surveillance camera and no video stabilization is
needed. All the images were converted from color to grayscale since color information was not used during
feature extraction. In addition, grayscale images were used during image registration. All these datasets are low
resolution; however, the height of people in some images is greater than 20 pixels. Although our detector was
designed to detect people with heights of 20 pixels, it can still detect these larger pedestrians. Since an image
pyramid was used, the detector can detect people at the image level where the height was about 20 pixels. This
guarantees that at some level of the pyramid the people will be close to 20 pixels height and can be detected by the
algorithm.
4.1.1 Stationary Datasets
The video sequences were taken by stationary cameras at the top of high buildings to record large numbers of
event instances across a very wide area while avoiding occlusion as much as possible. The cameras look down
upon a scene containing streets with buildings, trees, and parking lots. Cars and pedestrians periodically move
through the scene. The PETS 2001 dataset [17] is popular for automated surveillance research. It was also used by
Jones and Snow to evaluate their algorithm [16] (that we are comparing to). It contains 16 video sequences of
about two to four minutes length, with a frame rate of 25 frames/second, and frame size of 768 pixels in width and
576 pixels in height. Half the videos are designated as training, and half for testing. For this dataset, we extracted
2,560 training examples from the training videos. 960 of them were positive examples and 1600 were negative
examples. The second stationary camera dataset is the stationary VIRAT public 1.0 dataset [35], with a frame rate
of 30 frames/second, and frame size of 1280 pixels in width and 720 pixels in height. For this dataset, the training
set consisted of 1760 examples: 780 of them were positive examples and 980 were negative examples.

_________________________________________________________________________________________________________________________________________________
4.1.2 Aerial Datasets
The VIRAT Fort AP Hill aerial dataset [35] was recorded using an electro-optical sensor from a military aircraft
flying at a height up to 1000 meters. The resolution of these aerial videos is 640×480 with 30Hz frame rate, and
the typical pixel height of people in the collection is about 20 pixels. The videos include buildings and parking lots
where people and vehicles are engaged in different activities. The data is challenging in terms of low resolution,
uncontrolled background clutter, diversity in scenes, changing viewpoints, changing illumination, and low image
sharpness. For this dataset, the training set consisted of 1,280 positive examples and 1,280 negative examples.
The UCF-2009 (also known as UCF-LM) dataset [36] video sequences were obtained using an R/C-controlled
blimp equipped with a camera mounted on a gimbal in a dirt parking lot near the football stadium in Florida. The
flying altitudes ranged from 400–450 feet. Actions were performed by different actors. The UCF-2009 dataset has
a resolution of 540×960 pixels with a 23Hz frame rate. For this dataset, the training set consisted of 1,000 positive
volumes and 1,000 negative volumes. The UCF-2007 dataset [36] is an earlier dataset from UCF, and is more
challenging due to large variations in camera motion, rapidly changing viewpoints, changes in object appearance,
pose, object scale, cluttered background, and illumination conditions. In addition, it suffers from interlacing,
motion blur, and poor focus. The resolution is of 854×480 pixels with a 30 Hz frame rate. To remove artifacts
caused by the interlacing, we subsampled the images, so the final resolution was 427×240 pixels. For this dataset,
the training set consisted of 250 positive volumes and 250 negative volumes. For the set of sequences that we
have created; it was recorded using a DJI Phantom 3 Quadcopter. From now on we will refer to this data as the DJI
dataset. This dataset consists of aerial imagery of people walking across a field, at a frame rate of a 25 Hz with
image size of 720×1280 pixels. This dataset is challenging since the pedestrians are very small and there are
shadows cast by the pedestrians. In addition, there are variations in camera motion, changes in illumination
conditions, and some low image sharpness. For this dataset, the training set consisted of 780 positive volumes
and 1,260 negative volumes.
4.2 Experimental Results
Test sets were collected from portions of the videos that were different from the training sequences. ROIs were
detected in these images and the detector was scanned within each ROI. To show the main concept, Figure 11
presents an example of the results of applying the algorithm to one of the aerial datasets. After detecting ROIs, the
classifier was then applied to each volume within the ROIs. We shift volumes in the time direction every 4 frames.
This means that a pedestrian detected in frames 1:32 and in frames 5:37 is counted as two detected instances.
This helps in recovering from miss-detections and increases the testing set. Following [9], a detection is
considered to be correct if the area of overlap between the detection window and the ground truth window
exceeds 50%.
For PETS 2001, the total number of tested examples within the ROIs was 1,235 positive examples (16 slices each)
and 8,730 negative examples (16 slices each). Using the detector with the default parameters, a detection rate of
94.7% was achieved with a false positive rate (FPR) of 10-6. At the same FPR, the Dalal algorithm [3], which uses
single images, achieved a detection rate of 73%. On the same dataset, the Jones and Snow algorithm [16] achieved
a detection rate of 93%, when 8 detectors were combined. For the stationary VIRAT dataset, the total number of
tested examples was 720 positive examples and 3860 negative examples. Using the detector with the default
parameters, we achieved a detection rate of 91% with a false positive rate of 10-6. At the same FPR rate, the single-
frame detector of Dalal achieved a DR of 70%, on the same dataset. For the aerial VIRAT dataset, a total of 12,600
volumes were classified during the scanning over the detected ROIs, of which 5,016 were positive examples and
7,584 were negative. Using the detector with the default parameters, it achieved a DR of 78% at FPR of 4×10-3.
This value of FPR means that only 4 in 1000 tested non-pedestrian volumes were falsely classified as pedestrians.
At the same FPR the single-frame Dalal detector achieved a detection rate of 39%. For UCF 2009, a total of 5,880
volumes were classified; 2,184 of them were positives, and 3,696 were negatives. Using the detector with
parameters tuned for the best performance on this dataset, DR is 92% at FPR of 4×10-3. At the same FPR the Dalal
detector achieved a DR of 50%. For the UCF-2007 dataset, a total of 500 volumes were classified; half of them
positives and half of them were negatives. Using the detector with the default parameters, DR is 73% at FPR of
4×10-3. At the same FPR the Dalal detector achieved a DR of 41%. For the DJI dataset, a total of 1,820 volumes
were classified during the scanning over the detected ROIs, of which 740 were positive examples and 1,080 were
negative. The detector achieved a DR of 71% at FPR of 4×10-3. At the same FPR the single-frame Dalal detector
achieved a detection rate of 54%. Figures 12-15 shows detection examples on frames from the six datasets
discussed above. Detection results are shown as boxes, where TP is “true positive”, FP is “false positive”, TN is
“true negative”, and FN is “false negative” (to avoid cluttering the figures, we are not showing all the detected TNs).
Figure 12 (a,b) shows example of two frames from the Aerial VIRAT dataset with TP, FP, TN and FN results
labeled. Figure 12(c) shows the sequence of slices for one of the TP detections. Figure 12(d) shows an example of
a sequence of slices for one of TNs. TNs result from scanning the classifier around false ROIs that correspond to
motion regions resulting from non-pedestrian motion (e.g. vehicles) or from static objects due to non-perfect
stabilization.

_________________________________________________________________________________________________________________________________________________
Here an ROI was detected on a static object (the corner of a building). Since the motion pattern for this object does
not match that of a pedestrian, the classifier labels this as a non-pedestrian. Corresponding results on the UCF
2009 dataset are shown in Figure 13. Examples of sequences of slices for TN detections are shown in Figure 13(c,
d). The TNs shown here resulted from scanning the classifier around false ROIs corresponds to non-pedestrian
motion; e.g. a car and a bicycle. The motion patterns here do not match that of pedestrians, so the classifier labels
them as non-pedestrians. Figure 14 shows frames from UCF-2007 dataset with detections, and Figure 15 shows
examples from the stationary VIRAT dataset and PETS 2001 dataset with detections. Corresponding results on the
DJI dataset are shown in Figure 16. Examples of sequences of slices for TN detections are shown in Figure 16(c,
d).
(a) (b)

_________________________________________________________________________________________________________________________________________________
(c) (d)
Figure 12: (a,b) Two frames from aerial VIRAT dataset with detections. (c) Slices for one TP in (a). (d) Slices for
the TN shown in (b).
(a) (b)
(c) (d)
Figure 13: Detections on two frames from UCF-2009 Dataset. (a) FP and TPs. (b) TNs. (c) and (d) Sequences for
TNs shown in (b).
The TNs shown here resulted from scanning the classifier around false ROIs corresponds to non-pedestrian
motion; e.g. part of a moving vehicle. The ROC curves are plotted in Figures 17-18. Figure 17(left) shows the ROC
curves of the three detectors: Dalal detector, Jones-Snow detector, and our multi-frame HOG detector on PETS
2001 dataset.
(a) (b)
Figure 14: (a) and (b) Two frames from UCF-2007 dataset with example detections.
Figure 17(right) shows the ROC curves of the Dalal detector and the multi-frame HOG detector on the stationary
datasets. The multi-frame detector always outperforms the other detectors.
(a) (b)
Figure 15: Example detections, (a) PETS2001 dataset. (b) Stationary VIRAT dataset.

_________________________________________________________________________________________________________________________________________________
(a) (b)
(c) (d)
Figure 16: Detections on two frames from DJI Dataset. (a) TP, TN and FP. (b) TP. (c) and (d) Sequences for TP and
FP shown in (a).
Figure 18 shows the ROC curves of the Dalal detector and the multi-frame HOG detector on the four aerial
datasets. The multi-frame detector always gives a better detection rates than the Dalal detector
4.3 Performance Study and Discussion
We hypothesized that using information from multiple images in the detector should be better than using
information from only a single (or a few) image, since the motion patterns should aid recognition. We trained and
tested the detector on volumes consisting of a different number of slices per volume, ranging from 1 to 32 slices.
Using parameters tuned for the best performance, different classifiers were trained and tested. In each
experiment, the same number of slices per volume is used in training and in testing. Figure 19(top) shows the DR
of each classifier on the stationary datasets at FPR of 10-6, while Figure 19(bottom) shows the DR of each classifier
on the aerial datasets at FPR of 4×10-3. The results confirm that the use of multiple images for detection
dramatically improves the results. The improvement in detection rate increases with the number of slices, until a
total of 16 slices is reached (corresponding to about a half of a second of walking).
Figure 17: (Left) ROC curves of the three detectors on PETS 2001 dataset. (Right) ROC curves of the two detectors
on Stationary VIRAT dataset.
(a) (b)

_________________________________________________________________________________________________________________________________________________
(c) (d)
Figure 18: ROC curve for the Multi-Frame HOG detector and the Dalal detector on (a) Aerial VIRAT dataset. (b)
Aerial UCF-2009 dataset. (c) Aerial UCF-2007 dataset. (d) DJI dataset.
Figures 20-21 show the ROC curves for each dataset, as the number of slices is varied. In the aerial VIRAT dataset
(Figure 20), using a single slice per volume detector gives a DR of 40% at FPR of 4×10-3. At the same FPR, the use
of 16 slices per volume raises the DR to 78%. For the UCF-2009 dataset (Figure 21), using single frame (i.e., one
slice per volume detector), gives a DR of 50% at a FPR of 4×10-3. At the same FPR, the use of 16 frames improves
DR to 92%. Note that the single slice case corresponds to the Dalal detector. We also studied the effect of cell size.
Figure 22 shows the ROC for UCF-2009, as the cell size is varied. Cells of size of 4×4 pixels perform best. As shown
in Figure 23 (left), the size of a person’s head, forearm, upper leg, and lower leg appear to be approximately 4x4
pixels, which may allow cells of 4x4 pixels to capture the shape and motion of these parts. Figure 23(right) gives
some insight on what cues the detector uses to make its decision. It shows the weights corresponding to each
element of the feature vector; i.e., the value of the elements of w’s in the classifier decision function equation,
bxwxf T
)( . The weights are shown for the central slice in the volume. The figure shows that the contours
of pedestrian head, shoulders, and lower legs have the highest weights which represent the main cues for
detection.
Figure 19: The effect of the number of slices per volume on DR. (Top) stationary datasets. (Bottom) Aerial datasets.

_________________________________________________________________________________________________________________________________________________
Figure 20: ROC of the classifiers with different number of slices per volume: aerial VIRAT.
Figure 21: ROC of the classifiers with different number of slices per volume: UCF-2009 Dataset.
Figure 22: ROC Curve: Cell size effect.
Figure 23: (Left) Cells of size 4×4 pixels appear to match key parts of the pedestrian’s body. (Right) SVM positive
weights.
4.4 Frame Randomization
To confirm our hypothesis that the classifier learns the characteristic motion of walking pedestrians, we
randomized the order of frames. The expectation is that giving the classifier temporally incoherent data should
reduce the performance. We followed the same procedure as before to extract ROIs and form spatiotemporal
volumes. However, this time we randomized the slices in both the training and testing volumes. Figure 24 shows
detection rates obtained from multiple tests on the two datasets.
Figure 24: The effect of randomizing the order of frames on classification performance.

_________________________________________________________________________________________________________________________________________________
The results show that the use of randomized frame sequences degraded detection rates by an average of 8% in the
aerial VIRAT dataset experiments and by an average of 12% in the UCF dataset experiments at FPR of 4×10-3. For
example, for the UCF dataset, when the sequences were used in their normal coherent order, we obtained a
detection rate of 92%. In one of the tests in which randomized sequences were used, at the same FPR, the
detection rate degraded to 80%. These experiments indicate that the classifier is indeed learning the
characteristic motion of walking pedestrians.
V. CONCLUSIONS AND FUTURE WORK
We presented a method for detecting pedestrians in low-resolution videos, using a novel multiple-frame HOG-
based feature approach. The method is designed to detect pedestrians with heights as small as 20 pixels or higher.
On five public datasets including three challenging aerial datasets and our own created dataset, the method
achieves excellent results; i.e., detection rates of 78%, 92%, %73, and 71% at a false positive rate of 4×10-3 on
aerial VIRAT, aerial UCF-2009, aerial UCF-2007, and aerial DJI respectively. For the stationary datasets the
method achieves a detection rate of 94.7% and 91% at a false positive rate of 10-6 on PETs 2001 and stationary
VIRAT datasets respectively.
Figure 25: Example from the results of detecting pedestrians in challenging UAV videos. Detections are shown at
different pyramid levels.
We also have obtained excellent preliminary results on UAV videos [1] posted on YouTube (Figure 25). The
detector needs only a short sequence of frames to perform detection. Thus, it is applicable to situations where the
camera is moving rapidly and does not dwell on the same portion of the scene for very long. We studied the benefit
of using multiple frames on the performance of the detector. We confirmed that using additional frames improves
the performance significantly, up to about 16 frames. We also found that the detector can learn the coherence of
pedestrian motion. Future work should evaluate other classifiers to see if they perform better than the simple SVM
we used. Another direction is to train the classifier to detect multiple classes, such as fast walking, stationary, and
running people. Finally, our detector could be integrated into a standard tracker such as a multiple hypothesis
tracker or particle filter-based tracker. Even in aerial videos from rapidly moving cameras, a person is often in the
field of view for multiple seconds. Therefore, multiple detections can be associated into a single track, to improve
accuracy.
Acknowledgement
This work was partially supported by a gift from Northrop-Grumman.
REFERENCES
1. “Airsoft UAV footage”, downloaded from https://www.youtube.com/watch?v=rppbsvUSpxY, August 2016.
2. Johansson, Gunnar. 1973. "Visual perception of biological motion and a model for its analysis." Attention,
Perception, & Psychophysics 14, no. 2: 201-211.
3. Dalal, Navneet, and Bill Triggs. 2005. "Histograms of oriented gradients for human detection. 2005."
In Computer Vision and Pattern Recognition. IEEE Computer Society Conference, vol. 1, pp. 886-893.
4. Dollár, Piotr, Christian Wojek, Bernt Schiele, and Pietro Perona. 2012. "Pedestrian detection: An evaluation of
the state of the art." Pattern Analysis and Machine Intelligence, IEEE Transactions on 34, no. 4: 743-761.
5. Zhang, Shanshan, Rodrigo Benenson, Mohamed Omran, Jan Hosang, and Bernt Schiele. 2016. "How Far are
We from Solving Pedestrian Detection?." arXiv:1602.01237.
6. Enzweiler, Markus, and Dariu M. Gavrila. 2009. "Monocular pedestrian detection: Survey and
experiments." Pattern Analysis and Machine Intelligence, IEEE Transactions on 31, no. 12: 2179-2195.
7. Benenson, Rodrigo, Mohamed Omran, Jan Hosang, and Bernt Schiele. 2014. "Ten years of pedestrian
detection, what have we learned?” ECCV, CVRSUAD workshop.
8. A. González, D. Vázquez, S. Ramos, A. M. López, J. Amores. 2015. "Spatiotemporal Stacked Sequential Learning
for Pedestrian Detection". In Proceedings of the Iberian Conference on Pattern Recognition and Image
Analysis, Spain.

_________________________________________________________________________________________________________________________________________________
9. Felzenszwalb, Pedro, David McAllester, and Deva Ramanan. 2008. "A discriminatively trained, multiscale,
deformable part model." In Computer Vision and Pattern Recognition. IEEE Conference, pp. 1-8.
10. Park, Dennis, Deva Ramanan, and Charless Fowlkes .2010. "Multiresolution Models for Object Detection."
Proc. European Conf. Computer Vision (ECCV), pp. 241-254.
11. Klaser A, Marszałek M, Schmid C. A spatio-temporal descriptor based on 3d-gradients. 2008. In BMVC 19th
British Machine Vision Conference (pp. 275-1). British Machine Vision Association.
12. Yan, Junjie, Xucong Zhang, Zhen Lei, Shengcai Liao, and Stan Z. Li. 2013. "Robust multi-resolution pedestrian
detection in traffic scenes." In Computer Vision and Pattern Recognition (CVPR),IEEE Conference on, pp.
3033-3040.
13. Hu, Hai-Miao, Xiaowei Zhang, Wan Zhang, and Bo Li. 2015. "Joint global–local information pedestrian
detection algorithm for outdoor video surveillance." Journal of Visual Communication and Image
Representation 26: 168-181.
14. Viola, Paul, and Michael J. Jones. 2004. "Robust real-time face detection." International journal of computer
vision 57, no. 2: 137-154.
15. Viola, Paul, Michael J. Jones, and Daniel Snow. 2005. "Detecting pedestrians using patterns of motion and
appearance." International Journal of Computer Vision 63, no. 2: 153-161.
16. Jones, Michael J., and Daniel Snow.2008. "Pedestrian detection using boosted features over many frames."
In Pattern Recognition. 19th International Conference, pp. 1-4. IEEE.
17. PETS dataset. http://www.cvg.cs.rdg.ac.uk/pets2001/pets2001-dataset.html.
18. Dalal, Navneet, Bill Triggs, and Cordelia Schmid. 2006. "Human detection using oriented histograms of flow
and appearance." Computer Vision–ECCV: 428-441.
19. Wojek, Christian, Stefan Walk, and Bernt Schiele. 2009. "Multi-cue onboard pedestrian detection."
In Computer Vision and Pattern Recognition,(CVPR ), IEEE Conference on, pp. 794-801.
20. Mukherjee, Snehasis, and Dipti Prasad Mukherjee. 2015. "A motion-based approach to detect persons in low-
resolution video." Multimedia Tools and Applications 74, no. 21: 9475-9490.
21. Hua C, Makihara Y, Yagi Y, Iwasaki S, Miyagawa K, Li B. 2015. Onboard monocular pedestrian detection by
combining spatio-temporal HOG with structure from motion algorithm. Machine Vision and Applications
26(2-3):161-183.
22. Dollár, Piotr, Vincent Rabaud, Garrison Cottrell, and Serge Belongie. 2005. "Behavior recognition via sparse
spatio-temporal features." In Visual Surveillance and Performance Evaluation of Tracking and Surveillance.
2nd Joint IEEE International Workshop on, pp. 65-72.
23. Klaser, Alexander, Marcin Marszałek, and Cordelia Schmid. 2008. "A spatio-temporal descriptor based on 3d-
gradients." In BMVC19th British Machine Vision Conference. pp. 275-1.
24. Narayanaswami, Ranga, Anastasia Tyurina, David Diel, Raman K. Mehra, and Janice M. Chinn. 2011.
"Discrimination and tracking of dismounts using low-resolution aerial video sequences." In SPIE Optical
Engineering+ Applications, pp. 81370H-81370H.
25. Hai-Miao Hu, Xiaowei Zhang, Wan Zhang, Bo Li. 2015. "Joint global–local information pedestrian detection
algorithm for outdoor video surveillance". Journal of Visual Communication and Image Representation,
Volume 26 168-181.
26. Oreifej, Omar, Ramin Mehran, and Mubarak Shah. 2010. "Human identity recognition in aerial images."
In Computer Vision and Pattern Recognition (CVPR), IEEE Conference on, pp. 709-716.
27. Gaszczak, Anna, Toby P. Breckon, and Jiwan Han. 2011. "Real-time people and vehicle detection from UAV
imagery." In Proc. SPIE Conference Intelligent Robots and Computer Vision XXVIII: Algorithms and
Techniques, volume 7878, doi: 10.1117/12.876663
28. Reilly, Vladimir, Berkan Solmaz, and Mubarak Shah. 2010. "Geometric constraints for human detection in
aerial imagery." Proc. European Conf. Computer Vision (ECCV), pp. 252-265.
29. Andriluka, Mykhaylo, Stefan Roth, and Bernt Schiele. 2008. "People-tracking-by-detection and people-
detection-by-tracking." In Computer Vision and Pattern Recognition (CVPR), IEEE Conference on, pp. 1-8.
30. Basharat, Arslan, Matt Turek, Yiliang Xu, Chuck Atkins, David Stoup, Keith Fieldhouse, Paul Tunison, and
Anthony Hoogs. 2014. "Real-time multi-target tracking in wide area motion imagery." In Applications of
Computer Vision (WACV),IEEE Winter Conference on, pp. 839-846.
31. Leibe, Bastian, Aleš Leonardis, and Bernt Schiele. 2008. "Robust object detection with interleaved
categorization and segmentation." International journal of computer vision 77, no. 1-3: 259-289.
32. Lowe, David G. 2004. "Distinctive image features from scale-invariant key points." International journal of
computer vision 60, no. 2: 91-110.
33. OSU-SVM Toolbox for MATLAB. Last Update: 2009-07-17.http://sourceforge.net/projects/svm.

_________________________________________________________________________________________________________________________________________________
34. Chang, Chih-Chung, and Chih-Jen Lin. 2013. "LIBSVM: a library for support vector machines." ACM
Transactions on Intelligent Systems and Technology (TIST) 2, no. 3: 27.
35. Oh, Sangmin, Anthony Hoogs, Amitha Perera, Naresh Cuntoor, Chia-Chih Chen, Jong Taek Lee, Saurajit
Mukherjee et al. 2011. "A large-scale benchmark dataset for event recognition in surveillance video."
In Computer Vision and Pattern Recognition (CVPR), IEEE Conference on, pp. 3153-3160.
36. UCF Lockheed-Martin UAV Dataset. 2009. Retrieved from http://vision.eecs.ucf.edu/aerial/index.html.
37. Sager, Hisham and Hoff, William, 2014, March. Pedestrian detection in low resolution videos. In IEEE Winter
Conference on Applications of Computer Vision (pp. 668-673). IEEE.
38. Wang, Shiguang, Jian Cheng, Haijun Liu, and Ming Tang. 2018. "Pcn: Part and context information for
pedestrian detection with CNNs." arXiv preprint arXiv: 1804.04483.
39. Chen, Zhichang, Li Zhang, Abdul Mateen Khattak, Wanlin Gao, and Minjuan Wang. 2019. "Deep Feature Fusion
by Competitive Attention for Pedestrian Detection." IEEE Access.
40. Li, Zhaoqing, Zhenxue Chen, QM Jonathan Wu, and Chengyun Liu. 2019. "Pedestrian detection via deep
segmentation and context network." Neural Computing and Applications: 1-13.
41. Zhang, Liliang, Liang Lin, Xiaodan Liang, and Kaiming He. 2016 "Is faster r-cnn doing well for pedestrian
detection?." In European conference on computer vision, pp. 443-457. Springer, Cham.
42. Li, Chengyang, Dan Song, Ruofeng Tong, and Min Tang. 2019. "Illumination-aware faster R-CNN for robust
multispectral pedestrian detection." Pattern Recognition 85: 161-171.

PEDESTRIAN DETECTION IN LOW RESOLUTION VIDEOS USING A MULTI-FRAME HOG-BASED DETECTOR

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to PEDESTRIAN DETECTION IN LOW RESOLUTION VIDEOS USING A MULTI-FRAME HOG-BASED DETECTOR

Similar to PEDESTRIAN DETECTION IN LOW RESOLUTION VIDEOS USING A MULTI-FRAME HOG-BASED DETECTOR (20)

More from AM Publications

More from AM Publications (20)

Recently uploaded

Recently uploaded (20)

PEDESTRIAN DETECTION IN LOW RESOLUTION VIDEOS USING A MULTI-FRAME HOG-BASED DETECTOR