Multiple human tracking based on object detection has been a challenge due to its
complexity. Errors in object detection would be propagated to tracking errors. In this
paper, we propose a tracking method that minimizes the error produced by object
detector. We use RetinaNet as object detector and Hungarian algorithm for tracking.
The cost matrix for Hungarian algorithm is calculated using the RetinaNet features,
bounding box center distances, and intersection of unions of bounding boxes. We
interpolate the missing detections in the last step. The proposed method yield 43.2
MOTA for MOT16 benchmark
2. Dina Chahyati, Aniati Murni Arymurthy
http://www.iaeme.com/IJMET/index.asp 466 editor@iaeme.com
There are many deep learning architectures for image detection such as YOLO, SSD,
RetinaNet [8], etc. In our research, we use RetinaNet for image detection. RetinaNet is a one-
stage object detector that performs better than state of the art two-stage methods when it was
released in late 2017 outperforming YOLOv2 and SSD513. RetinaNet may use VGG,
DenseNet, Mobilenet,,or ResNet as its backbone.
Unfortunately, being the state of the art does not mean perfect. There are still limitations
in the object detection result that makes it difficult to use for object tracking, as it would
propagate errors. There are at least two errors that an object detector commonly suffer from:
false positive (FP) and false negative (FN) detection. FP may appear in three forms as shown
in Fig.1, one is overdetection (two persons detected as three), second is under detection (two
persons detected as one), and the third is inconsistent size.
FN means that the detector cannot find a person in a certain location while there should be
one. There are two categories in false negatives: when a person is detected in some frames but
not in between frames, or when a person is totally undetected all over the scene.
(a) (b) (c)
Figure 1 Problems in detections: (a) overdetection, (b) under detection, (c) inconsistent size
We should be able to minimize these errors if we retrain or fine tune the model. However,
deep learning models need high-performance computing resources or at least GPU for the
training process. In this paper, we propose a way of handling the detection errors in the
tracking process in a limited computing environment, where GPU is not. Nevertheless, even
after fine-tuning, every deep-learning based detection method would still yield errors in the
detection. In this paper, we want to find a way of optimizing the detection result for tracking.
We use RetinaNet for object detection with ResNet50 [9] as the backbone. We then use the
combination of bounding box center, intersection of union, and feature similarity from
Siamese network as input for Hungarian Algorithm as object association technique.
2. RELATED WORK
There are major components that we used in our research: object detector (RetinaNet),
method to associate objects between frames (Hungarian algorithm), method to evaluate the
similarty between two feature vector (Siamese Network), and evaluation of tracking result.
We will explain each of those concepts briefly.
2.1. RETINA NET
We use RetinaNet since it is considered as one of the most cited object detector that is
available for public use. RetinaNet claimed as the first one-stage object detector that matches
the state-of-the-art COCO AP of more complex two-stage detector such as Feature Pyramid
Network (FPN) or variants of Faster R-CNN [10]. Two-stage detector means that in the first
stage the system generates sparse candidate set of object locations or bounding boxes, and in
3. Multiple Human Tracking Using Retinanet Features, Siamese Neural Network, and Hungarian Algorithm
http://www.iaeme.com/IJMET/index.asp 467 editor@iaeme.com
the second stage the system classifies the object into several classes. One-stage detector does
the classification over dense object detectors that usually performs faster but less accurate
than two-stage detector.
RetinaNet is a single, unified network that consists of a backbone network (compute
convolutional feature map of input image) and two-task specific subnetworks (object
classification and bounding box regression). The key innovations of RetinaNet is the use of a
new loss function called Focal loss to improve the accuracy of the detector while still
maintaining its detection speed.
In our research, we use ResNet as backbone, as suggested by the RetinaNet paper [8]. It
consists of total 216 layers. The feature for each object detected that we used in this paper is a
256-vector extracted from 2D convolutional layers P3, P4, P5, P6, and P7. P3 to P5 are
computed from the output of the corresponding ResNet residual stage (C3 to C5).
As an illustration, suppose we have a typical 1920x1080 MOT image. The image is then
resized such that its longer side is 1333 pixel. The resized image is the divided into anchor
boxes with various sizes. In our case, the image is resized into 1333 x 750 pixel and the
output feature maps are shown in Table 1.
Table 1 RetinaNet Feature Map Size
Layer
Size of Feature Map
Number of
Feature Vector
Anchor Size
P3 167 x 94 x 256 15698 8 x 8
P4 84 x 47 x 256 3948 16 x 16
P5 42 x 24 x 256 1008 32 x 32
P6 21 x 12 x 256 252 64 x 64
P7 11 x 6 x 256 66 128 x 128
Total 20972
Each of those 20972 feature vector sized 256 are then associated with three scale (20
, 21/3
,
22/3
) and three aspect ratio (1:2, 1:1, 2:1) of the original size. All of the output are then
concatenated as input for classification and regression layers in the RetinaNet. The regression
layer is responsible for predicting the appropriate bounding boxes for each feature maps under
9 corresponding scales and ratios. Thus, for our image we have 20972 x 9 = 188748 candidate
bounding boxes.
2.2. Hungarian Algorithm
Suppose we input two consecutive frames, t and t + 1 to RetinaNet and find detected persons
in each frame. Since each frame consists of many persons, we need to associate or assign each
person in frame t and t + 1 in order to track them. One of the most known and simplest
methods of assignment problem is the Hungarian algorithm. Given a cost matrix, Hungarian
algorithm will match each elements of row and columns in order to find the minimum cost, as
illustrated in Fig.2. Person A will be associated with person D, B with E, dan C with F
because this combination yield the minimum total cost.
If we want to use Hungarian algorithm for tracking, we have to provide it with appropriate
cost matrix between each persons in the frames. Other than the position of the bounding box,
we want to take into account the convolutional feature of each object because it should
represents the visual appearance. Suppose we have a pair of feature vector representing two
detected person in different frames, then we must find a metric that evaluate how similar
those vectors are.
4. Dina Chahyati, Aniati Murni Arymurthy
http://www.iaeme.com/IJMET/index.asp 468 editor@iaeme.com
Figure 2 Example of Hungarian algoritm applied for tracking between frames
The intuition used is that similar objects should have considerably similar feature vector.
This is true for some of the objects but untrue for other ones. We have tried using simple
Euclidean distance, but the result shows that some objects have very far feature difference
even though their appearance look similar, as shown in Fig. 3.
Frame 1 Frame 2
Figure 3 Euclidean distance of feature vectors are not always consistent with their visual appearance
Table 2
Frame 1 Frame 2 Bounding box
distance
Feature
vector
distance
7001 7001 1.00 10.71
133151 133151 3.00 20.30
147751 149263 1.00 45.67
151549 151549 4.12 17.96
151828 151828 3.16 16.47
152520 152520 5.10 7.37
160127 160127 1.00 9.23
163763 163763 1.41
5. Multiple Human Tracking Using Retinanet Features, Siamese Neural Network, and Hungarian Algorithm
http://www.iaeme.com/IJMET/index.asp 469 editor@iaeme.com
2.3. Siamese Neural Network
The unexpected far distances between similar-looking objects in consecutive frames may be
caused by some noises in the images that is invisible to our eyes. Therefore we need more
robust way to define similarity between two RetinaNet vector features. We uses Siamese
Network introduced by [11]. In their paper, the writer noted that Siamese network is good for
classifying objects that has only few instances per class but the number of classes are many.
Siamese Neural Network (SNN), as the name explains, consists of twin functions GW
which share the same set of parameters W, and a cost module that generates distance
‖ ( ) ( )‖. The input of SNN is a pair of images (x1, x2) and a label Y. The
label Y may be Y = 0 (similar) or Y = 1 (dissimilar). The output of Siamese network is a
similarity number, e.g. if it is less than 0.5 then it is considered similar.
2.4. Evaluation Method
One of the most popular tracking evaluation is MOT evaluation metric [7]. We use 6 tracking
metrics of MOT:
Recall: percentage of correctly detected targets, compared to the ground truth
MT: mostly tracked trajectories, more than 80% of ground truth trajectory length is tracked
FP: number of false positives
FN: number of missed targets
IDs: number of ID switches. ID switch happens when a person is detected as different person
due to missed association or is it was occluded by other objects
MOTA: multi-object tracking accuracy in [0,100], MOTA = 1 – error, where error is defined
as (FN+FP+IDs) divided by the ground truth
3. PROPOSED METHOD
In this section, we discuss more details about the method proposed and the dataset that we
used. The flowchart of our proposed method for the first three frames is shown in Fig. 4. The
process of Frame 4 and the rest are the same with Frame 3. Each frame goes to RetinaNet to
get the bounding box and convolutional (CNN) features.
Figure 4 Flowchart of our method
6. Dina Chahyati, Aniati Murni Arymurthy
http://www.iaeme.com/IJMET/index.asp 470 editor@iaeme.com
The original RetinaNet code only outputs the best 300 filtered bounding boxes along with
the information of what object might be at the bounding box (person, car, etc). We use
resnet5_coco_best_v2.1.0 model in the RetinaNet that classifies objects into 80 classes, and
take only the “person” class. In order to get the Conv2D feature, we need to extract the
contents of RetinaNet intermediate layers that stores the complete 188748 candidate bounding
boxes (the clip_boxes layer) and find the corresponding Conv2D feature vector. These feature
vector may come from layer P3, P4, P5, P6, or P7. In our figures, the number next to each
bounding box is the number that identifies which box is it from 188748 candidate. Suppose
the box number is 150019, then it came from layer P4, because 150019/9 is within the range
of P4.
After collecting RetinaNet features for every detected person in all frame, we need to
associate same persons in each consecutive frame using Hungarian algorithm. Two key
components of Hungarian algorithm is to define the sets of task (row) and assignee (column)
to be associated and the cost function between each pair of task-assignee. The straightforward
way is to consider persons detected in frame t as task and persons detected in frame t + 1 as
assignee, but since we have occlusion problems, we cannot do this. Instead of pairing the
persons in consecutive frames, we try to pair the persons in frame t + 1 to the set of persons
found in frame 1 to frame t. This would solve the problem when a person is undetected in
intermediate frames due to occlusion or missed detection. In Fig. 5, this process is shown in
the processing of Frame 3, where the input of SNN in this step comes from RetinaNet output
of Frame 3 as assignee and the list of of all tracked person (not only the output of frame t – 1)
as tasks.
Another important components of Hungarian algorithm is how to define the cost between
any pair of detected persons in different frames. In our study we define the cost as
combination of three metrics: the distance of bounding box centers, intersection of union
(IoU) of the bounding box, and the output of Siamese network. The distance of bounding box
centers is used for simplicity, because in most cases, the distance of the same person in
consecutive frames are quite near compare to others.
The intersection of union between two candidate pairs is useful when the bounding box
size of the same person are not the same in consecutive frames, as shown in Fig.1c. We can
see that the height of the same person are detected quite differently. If we only use the
distance of bounding box centers, they will be considered different person because the
distance is too far. By using intersection of union between the two bounding boxes, we can
consider these detected person as the same person.
The SNN that we use to calculate feature similarity consists of three 128-node dense layer
with Relu activation, alternated with two 10% drop-out layers. We trained the SNN with 300
first frames of each training scenes of MOT16. We combine all the data from all scenes to
build a more general SNN that can be used throughout all test scenes of MOT16.
IoU is defined as the ratio between intersection and union of each pair of bounding box, so
the scale is [0,1]. The output of SNN is a number such that the two input images is considered
similar if the output is less than 0.5. To match up with the previous two metrics, the distance
of bounding box centers need to be normalized. We normalize it by dividing with a number
such that it is considered close if the distance is less than 0.5. In our case, we get number 60
as normalizer that comes from the observation of scene 04 MOT16 dataset.
Original Hungarian algorithm accepts square cost matrix where the rows and columns
indicates task and assignee. Further modification allows non-square matrix as input, where the
number of assignment will follow the minimum of row or column size. In our case, this will
be a problem when we have a new person in current frame, since he will be forced to be
matched to existing persons found so far. That is why we need to apply certain threshold to
7. Multiple Human Tracking Using Retinanet Features, Siamese Neural Network, and Hungarian Algorithm
http://www.iaeme.com/IJMET/index.asp 471 editor@iaeme.com
the cost matrix such that if a detected person in current frame has far distances with our
existing list of persons, then he will be considered new person, and he will be omitted from
the input cost matrix of the Hungarian algorithm. In our experiment we use the threshold such
that if the sum of all three metrics is greater than 1.8, it will be considered new person.
After running the process for all the frames in the scene, we do other adjustments. The
first is to delete all tracks whose length is less than five. This step is important for deleting
tracks caused by false positives (overdetection as shown in Fig. 1). We chose five as threshold
because we notice that in MOT16 train dataset, overdetection of RetinaNet does not appear in
a long range.
Other additional adjustment is to apply interpolation to the resulting track. Some objects
may be miss-detected in certain frame but detected before or after. We tried to add the
bounding box of those missed object to decrease the number of false negatives.
4. RESULT AND DISCUSSION
We used MOT16 dataset as our benchmark. We run the experiments on computer with
specification of Intel Core i7-7700 CPU @3.69 GHz, 8 GB RAM. Since we only use
RetinaNet for testing and not for training nor fine tuning, we did not use the GPU. SNN
training is also done in CPU mode. The result of MOT16 test scenes are shown in Table II.
The detection recall is 49.4, means that less than half of total person are detected.
Table 2 Tracking Result of Proposed Method
Scene Recall MT FP FN IDs MOTA
01 44.0 26.1 % 140 3579 38 41.3
03 51.1 16.9 % 5271 51151 495 45.6
06 69.2 30.8 % 1644 3554 281 52.5
07 49.4 14.8 % 731 8267 138 44.0
08 39.6 20.6 % 937 10108 134 33.2
12 55.3 23.2 % 381 3707 47 50.2
14 36.1 6.1% 717 11805 359 30.3
Overall 49.4 19.8% 9821 92171 1492 43.2
Compared to other method in MOT16 benchmark, we have not beat state of the art tracker
HCC [12] with overall MOTA 49.3 for public detector category. This is because we did not
do any training nor fine-tuning for the detector due to computing environment limitation. We
use existing publicly available model for object detection and use methods that does not
require high computational cost.
Table 3 Comparison with Other Methods
Method MT FP FN IDs MOTA
HISP_T [13] 7.8% 6412 107918 2594 35.9
AM_ADM [14] 7.1% 8503 99891 789 40.1
Ours 19.8% 9821 92171 1492 43.2
LMP [6] 18.2% 6654 86245 481 48.8
HCC [12] 17.8% 5333 86795 391 49.3
We notice that the reason of our low MOTA is caused by the high number of FP, FN and
IDs. Since scene 3 contributes the highest errors, and the corresponding training scene for that
scene is scene 4, we captured examples of FP and FN in scene 4 in Fig. 5. The blue boxes are
the output of RetinaNet. The red boxes surrounding blue are the detection results that are
considered as FP. The green boxes are the FNs, those which are supposed to be there but
8. Dina Chahyati, Aniati Murni Arymurthy
http://www.iaeme.com/IJMET/index.asp 472 editor@iaeme.com
failed to be detected by RetinaNet. As we can see, RetinaNet failed to detect person if they are
occluded. Persons in the bottom frame like A and B are considered both FP and FN because
the real bounding box in ground truth considers the whole body, while the detection bounding
box only counts what is seen in the image, so the height of the detected and ground truth
object is different. This also cause the FP person C. RetinaNet only sees the upper body part
while the ground truth consider the whole body. Persons in D are also not detected because
only small part of legs are shown. In E, two persons are considered one, because of the severe
occlusion.
The high number of ID switch also remains a problem. We explore the reason and found
that one of the reason is because of the high number of FN and FP. As we can see in Fig. 6,
one of the detection in Frame 15 is considered FP (red box) because it failed to satisfy the IoU
with ground truth threshold. Detection (blue) 150019 is considered to be the lower person in
Frame 15, and detection 151531 is considered FP. In this case, the evaluation program says
that there is only one person in Frame 15 and he is associated as the lower person. However,
the two persons in Frame 16 are detected correctly, since they are not considered FN nor FP.
In this case, in Frame 16, detection 150019 is considered as the upper person. This mistake is
propagated in the tracking process, as shown in Fig. 7. Our method actually tracked correctly
the upper (pale yellow track) and lower (pale green track) person, but it was considered as ID
switch by the evaluator program. As we can see the track was correct when it is extended to
frame 28. Therefore, even though the ID switch number is high for our method, it does not
always represents a mistake.
Figure 5 Examples of FP dan FN in scene 4 frame 50
Figure 6 Example of detection (blue), ground truth (yellow), FP (red), FN (green) bounding boxes
9. Multiple Human Tracking Using Retinanet Features, Siamese Neural Network, and Hungarian Algorithm
http://www.iaeme.com/IJMET/index.asp 473 editor@iaeme.com
Figure 7 Example of cases that is considered as ID switch (cyan box) while actually it is not
As we have mentioned before in the introduction section, there are five problems that
object detector commonly suffers. The discussion of whether our tracking method has
overcome these problems is as follows.
Overdetection
Overdetection is the situation when two persons are detected as three or more. Fortunately,
RetinaNet does not over detect consistently in consecutive frames. For example, in scene 3,
overdetection of two persons in Fig. 8 occur in frame 7, 8, 9, 33. We tried to overcome this
problem by putting a threshold of track length. If a track length is less than 5, then it is
deleted.
Under detection
Under detection is when two person is detected as one. In our approach, the detection is
assigned to one person, whose similarity is closer.
Inconsistent size
Since we use the information of IoU, even though the size is not consistent, our method were
still able to track correctly.
Undetected in some frames
Since we keep the list of all person detected from frame 1 to frame N and compare the
detection of frame N + 1 with that list, person who were undetected in some frame were still
recognized after skipping some frames. Interpolation in the last step also helps to recover the
missing detections. If we compare Table 2 (with interpolation) and Table 4 (without
interpolation), we can see that interpolation increase the MOTA from 41.4 to 43.2. Even
though it caused unexpected FP by around 2000, but it also decrease the FN by around 6000
as expected.
Undetected in all frames
As for this problem, there is no way to solve this problem unless we improve the accuracy of
the detector. We should fine tune RetinaNet with new examples specific to the dataset in
order to improve the accuracy of the detector. Fine tuning the detector (RetinaNet) is out of
the scope of this paper.
Table 4 Tracking Result of Proposed Method Without Interpolation
Scene Recall MT FP FN IDs MOTA
01 40.9 17.4% 95 3779 43 38.7
03 46.9 10.8% 4193 55516 578 42.3
06 66.4 26.2% 992 3878 296 55.6
07 47.0 9.3% 462 8646 145 43.3
08 38.6 19.0% 725 10283 143 33.4
12 53.5 20.9% 259 3859 59 49.6
14 32.7 4.9% 357 12444 345 28.9
Overall 46.0 15.9% 7083 98405 1609 41.3
10. Dina Chahyati, Aniati Murni Arymurthy
http://www.iaeme.com/IJMET/index.asp 474 editor@iaeme.com
The example of each problems is shown in Fig. 8. The white boxes represents the
detections that is tracked. The blue box shows the detections that is not included in the final
track because it was suspected to be FP (track length less than 5). The orange boxes are the
interpolated boxes. RetinaNet originally does not detect these bounding boxes, but they were
obtained by interpolation of bounding boxes in the last step.
The drawbacks of our method is the great dependency with the threshold chosen, which is
1.8 in our experiment. If the sum of bounding box center distance, the ratio of IoU and the
feature similarity is greater than 1.8, we consider it a new person that would yield new track.
The moving speed of person or camera would greatly affect the bounding box center distance,
and thus the threshold should be adjusted. This will be the focus of our next research.
Figure 8 Example of tracking result
6. RESULT AND DISCUSSION
Publicly available RetinaNet model without fine tuning are able to detect persons in MOT16
dataset with about 46% recall. This means that there are still problems with the detections
results. Our proposed method shows that without the need of high performance computing
environment, we are able to solve some problems in detections for tracking. Even though the
MOTA out our method has not beat the state the art benchmark for MOT16, we hope that our
result can be a baseline for other methods with similar circumstances, i.e. making use of
convolutional features from deep learning architecture object detectors, but working in non-
GPU computing environment.
REFERENCES
[1] X. Li, K. Wang, W. Wang, and Y. Li, “A multiple object tracking method using Kalman
filter,” 2010 IEEE Int. Conf. Inf. Autom. ICIA 2010, vol. 1, no. 1, pp. 1862–1866, 2010.
[2] S. Shantaiya, K. Verma, and K. Mehta, “Multiple Object Tracking using Kalman Filter
and Optical Flow,” Eur. J. Adv. Eng. Technol., vol. 2, no. 2, pp. 34–39, 2015.
[3] A. Kulkarni and E. Rani, “KALMAN Filter Based Multiple Object Tracking System,” Int.
J. Electron. Commun. Instrum. Eng. Res. Dev., vol. 8, no. 2, pp. 1–6, 2018.
[4] L. Leal-Taix, C. Canton-Ferrer, and K. Schindler, “Learning by tracking : Siamese CNN
for robust target association,” in CVPR, 2016, pp. 33–40.
[5] K. Zhang, Q. Liu, Y. Wu, and M. Yang, “Robust Visual Tracking via Convolutional
Networks Without Training,” IEEE Trans. IMAGE Process., vol. 25, no. 4, pp. 1779–
1792, 2016.
[6] S. Tang, M. Andriluka, B. Andres, and B. Schiele, “Multiple People Tracking by Lifted
Multicut and Person Re-identification,” in CVPR, 2017.
[7] A. Milan, L. Leal-Taixe, I. Reid, S. Roth, and K. Schindler, “MOT16: A Benchmark for
Multi-Object Tracking,” pp. 1–12, 2016.
11. Multiple Human Tracking Using Retinanet Features, Siamese Neural Network, and Hungarian Algorithm
http://www.iaeme.com/IJMET/index.asp 475 editor@iaeme.com
[8] T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal Loss for Dense Object
Detection,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2017-Octob, pp. 2999–3007, 2017.
[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in
CVPR, 2016, pp. 770–778.
[10] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN : Towards Real-Time Object
Detection with Region Proposal Networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol.
39, no. 6, pp. 1–14, 2017.
[11] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality Reduction by Learning an
Invariant Mapping,” in CVPR, 2006.
[12] L. Ma and S. Tang, “Customized Multi-Person Tracker,” in ACCV, 2018, pp. 1–16.
[13] N. L. Baisa, “Online Multi-target Visual Tracking using a HISP Filter,” in International
Conference on Computer Vision Theory and Applications, 2018, no. March, pp. 429–438.
[14] S. H. Lee, M. Y. Kim, and S. H. Bae, “Learning discriminative appearance models for
online multi-object tracking with appearance discriminability measures,” IEEE Access,
vol. 6, pp. 67316–67328, 2018.