2. This paper presents a real-time deep learning based
traffic volume survey method that counts and classifies
vehicles pass through sections of high-traffic volume urban
arterials road at different time frame (morning and
afternoon), different video resolution and frame rate (10fps
and 25fps), different clip lengths with total number of
vehicle for each location varies from around 5000 to 10,000
vehicles per hour.
This paper is arranged as follows: Section II describes
related works on state-of-the-art algorithms for traffic
volume count based on image processing and deep-learning
approaches. Section III discusses the proposed traffic volume
count method in detail, followed by experimental results and
discussion in Section IV. Finally, the conclusion will be
drawn in Section V.
II. RELATED WORKS
In recent years, Intelligent Traffic System (ITS) is mainly
designed under two major techniques, image processing and
deep learning. For those are using image processing
approach, they have implemented foreground-background
subtraction, image segmentation, feature extraction, feature
tracking, blob classification and etc. Daigavane and Bajaj [9]
presented a background subtraction and image segmentation
based on morphological transformation for tracking and
counting vehicles on highways. This system combines
simple domain knowledge about object classes with time
domain statistical measures to identify target objects in the
presence of partial occlusions and ambiguous poses. Xiang et
al. [10] illustrated a vehicle detection system on unmanned
aerial platform using pixel-level video foreground detector
and online-learning tracker and this system achieved
satisfactory result at high illumination scene with
uncongested traffic.
Gupte et al. [11] demonstrated a vision-based vehicle
detection and classification of vehicles using background
subtraction and detected region size. Ruimin Ke [12]
proposed a vehicle speed detection system by extracting
feature points from aerial image frames and performs
interest-point tracking using Kanade–Lucas optical flow
algorithm. Ma and Grimson [13] presented a vehicle
classification system using a repeatable and discriminative
feature from edge points and modified SIFT descriptors and
they achieved satisfactory performances on cars versus
minivans and sedans versus taxies classification tasks.
Memon [14] presented a traffic intelligence system
involves vehicle detection using Gaussian Mixture Model
(GMM) background subtraction and vehicle classification
using Bag of Features (BoF) and Support Vector Machine
(SVM) method with contour area information. For these
pixel targeted image processing approaches, they have the
advantage where each procedure in the algorithm can be
crafted and designed in a justifiable manner but they might
suffer on complex and rapid change environment such as
night or raining scene because these complex scene required
high level of mathematical calculation and these issues could
lead to poor detection and accuracy. In addition, generally,
image segmentation faces great challenge in traffic
congestion environment due to occlusion effect and it affects
end result as well [14].
Therefore, detecting and counting vehicle using deep
learning approach has become a leading trend in nowadays
[11, 15-17]. Similar training/testing procedures as other
machine learning approaches, deep learning is an approach
where a large number of data is collected to train multiple
layers in order to extract higher level information or features
in a progressive procedure. [18-21] are the famous object
detection which able to detect multiple objects and they are
popular to be adopted into vehicle detection process in ITS.
Biswas et al [11] presented a new framework called
OverFeat which is a combination of Convolution Neural
Network (CNN) and one machine learning classifier (like
Support Vector Machines (SVM) or Logistic Regression)
where CNN is used to extract feature and machine learning
classifier is used to make decision for vehicle presented in
ROI. Gomaa et al [15] presented an algorithm combining
vehicle detection using CNN based background subtraction
and vehicle counting using KLT tracker and K-means
clustering in complex traffic scenes. In this study, they
achieved good result using uncongested highway traffic
video in various dynamic lightning changes. Z. Dai
[16] proposed a video-based vehicle counting framework
using a three-component process of Yolov3 based object
detection [21], matching algorithm based multi-object
tracking, and trajectory counting to obtain the traffic flow
information. As discussed in this research, camera angle is
playing important role where this proposed algorithm could
perform well in less congested environment (as shown in the
T-junction), however, if vehicle moved to a large proportion
in the image, and the occlusion of the image was severe, it
affected the detection result significantly, as shown in the
straight road result. Instead of using convolutional neural
network approach, Zhang et al [22] developed a deep spatio-
temporal neural networks to sequentially count vehicles and
measure density from low quality videos captured by city
cameras. Similar as [16], Song et al [17] also demonstrated a
vehicle detection and counting system using Yolov3 [21] to
detect vehicle and then ORB feature matching is carried out
to identify vehicle moving trajectory. In this research, it
showed improvement on detecting small vehicle objects and
at the same time, it also pointed that small vehicle blocked
by large vehicle and multiple vehicles moving in parallel
affected the tracking result. Hence, in general, these deep
learning based approaches showed high accuracy on
detecting vehicle in low illumination change environment
(night) and partial occlusive effect, however, due to deep
learning is sensitive on pixel level change, so pixel
distortion such as shadow, car light and even rain drop
might easily affect detection performance from deep
learning system and create false result. In addition, severe
occlusion still remains a difficult task for deep learning
approach to handle as well.
III. DEEP-LEARNING BASED TRAFFIC VOLUME SURVEY
METHOD
Proposed algorithm is illustrated in Fig. 1. Firstly,
vehicles are localized and classified in the input image
frame. All detected vehicles in the image are then filtered
based on minimum confidence level, minimum object size
and appearance within the specified region-of-interest (ROI).
Only detected vehicles which fulfill the condition will be
54
Authorized licensed use limited to: University of Exeter. Downloaded on June 09,2020 at 10:38:04 UTC from IEEE Xplore. Restrictions apply.
3. proceeded to the next process; vehicle tracking. Tracked
vehicle’s class are accumulated in each frame within the
tracking period and finally smoothed before counted in final
counter.
Fig. 1. Algorithm flow for proposed traffic volume count.
A. Vehicle Localization and Classification
As discussed in Section II, Yolov3 has shown excellent
performance in detecting objects in low lighting environment
and mild to medium occlusion. Thus in this work, Yolov3 is
applied to localize object in the image and classify them.
YOLO network consists of 24 convolutional neural network
(CNN) layers and 2 fully connected (FC) layers. Weights for
the first 20 layers are copied from darknet53 model which
has been trained using ImageNet classification dataset. Then
another 4 CNN layers and 2 FC layers are appended to the
model which is further trained using COCO detection
dataset. Output from YOLO model are the object location in
the image and the object class label (one out of 80 possible
object labels based on COCO). We are only interested in 6
object class label and map them to the corresponding 4
vehicle classes of our interest as in Table 1. Sample vehicle
detection output using YOLO is shown in Fig. 2 in blue
boxes.
TABLE I. MAPPER COCO CLASS LABEL TO OUR VEHICLE CLASS
LABEL
COCO class label Vehicle class label
Person Motorcycle
Bicycle Motorcycle
Motorcycle Motorcycle
Car Car
Truck Truck
Bus Bus
Fig. 2. Sample vehicle localization and classification output using YOLO;
blue box is the detected object with class label and confidence value
displayed on top of the object box. White area is the region-of-interest.
B. Multiple Vehicle Tracking
All detected vehicles within the region of interest (ROI)
will be tracked to ensure that they are counted only once.
The aim of the tracker is to keep consistent track label for
each vehicle over successive frames. Single Kalman filter
models (constant velocity) are used as the basis tracker for
this multiple object tracking as most of the vehicles are
moving in constant velocity over a short stretch of road
under survey. Flow of the tracking algorithm is illustrated in
Fig. 3.
Fig. 3. Algorithm flow of proposed multiple vehicle tracking.
Referring to Fig. 3, firstly the prediction routine is run on
all trackers that were previously created. Next, gating region
of each tracker is constructed around its predicted location.
In this work, rectangular region is used as the gating region
with the width and height is dependent on the tracker’s
predicted height and width.
In observation-tracker association step, all observations
(referring to the filtered vehicle in current frame) is
compared to all the trackers to find the best match.
Observations which appear within tracker’s gating region
will be considered as the tracker’s potential association. Final
data association problem here is then solved using global
nearest neighbor, with the assumption that each tracker will
match with only one observation. In this step, observation-
observation association is also performed to check for
potential fragmented observation, especially when a person
on a motorbike is detected as one different object from the
motorbike itself. In this case, both objects need to be grouped
together as a single vehicle.
In track smoothing, each tracker states will be updated
based on current observation-tracker association results.
Tracker state vector, is represented as
55
Authorized licensed use limited to: University of Exeter. Downloaded on June 09,2020 at 10:38:04 UTC from IEEE Xplore. Restrictions apply.
4. (1)
Next, in track management step, tracker with missing
observation for more than certain number of frames will be
deleted. New tracker will also be initialized here for
observations that do not have any association with existing
tracker. Sample output from vehicle tracking step is shown
in Fig. 4 (frame # 219, 246 and 276). Green boxes are the
detected vehicle, while blue box is the corresponding tracker
box. Tracked vehicle that has been counted is colored in red.
Tracker label is printed on top of each tracked vehicle box.
Notice that the tracker is able to track vehicles within
cluttered road condition.
Fig. 4. Sample output from multiple vehicle tracking step for traffic volume
count. Green, blue and red boxes are the detected, tracked and counted
vehicle respectively. Tracker’s label is displayed on top of each tracker
box.
C. Vehicle Class Smoothing
Each tracked vehicle may have been classified into
different vehicle class at different image frame. Thus, the
final vehicle class will be smoothed and determined based on
the highest class frequency of the tracked vehicle. Class
frequencies is updated in each frame throughout the tracking
period by accumulating the confidence value of the class
starting from the tracker is initialized.
D. Vehicle Counting
In this work, counting is done based on user defined
virtual ROI and virtual line-of-interest (LOI). Defined ROI is
to indicate the valid area for vehicle localization and
tracking. Vehicles detected outside the ROI will be filtered
out. Confirmed tracked vehicle which pass the LOI will be
counted to final counter according to its final smoothed class.
Sample counting output is shown in Fig. 4, where the red
box indicates that the tracked vehicle has been counted. Once
the vehicle has been counted, the vehicle tracker’s flag will
be triggered so that the same vehicle is not counted for more
than once.
IV. RESULTS AND DISCUSSIONS
A. Experimental Setup
To evaluate the performance of the proposed method,
videos are recorded from cameras which temporarily
installed at the road side of urban arterials. Four different
views with multiple lanes are selected as shown in Fig. 5.
Variation of testing video properties are listed in Table II
which covers different video clip duration, resolution,
evolution time frame, and frame rate. Each video has been
manually annotated by an expert in computer vision. At each
manual annotation pass, only one class of vehicle will be
counted to minimize the manual annotation error. Total
number of vehicles in camera views varies from 9,000 to
20,000 for 2–hours of testing data. In Table II, estimated
AADT is calculated based on the traffic trends for the
selected camera views as shown in Fig. 6. The trend lines in
Fig. 6 is obtained based on the manual annotation or ground-
truth for each location in 16hours. Thus, the AADT is
estimated by multiplying min volume-per-15min duration to
16 hours (assuming that the traffic is only active within these
16-hours duration). The red boxes in Fig. 6 indicate the video
time frames evaluated in this paper.
Fig. 5. Camera views of four selected urban arterial road segments for
testing.
TABLE II. TESTING VIDEOS PROPERTIES
Fig. 6. Traffic trend for arterial road of view #2, view #3 and view #4 for
16hours. Red boxes indicate the period of evaluation in this paper.
Camera view View #1 View #2 View #3 View #4
Number of clips &
duration per clip
24/ 5min 8/ 15min 8/ 15min 8/ 15min
Video resolution &
frame rate
640x352/
10-11 fps
704x576/
25fps
704x576/
25fps
704x576/
25 fps
Number of lanes 3+1 3 5 4
Evaluation time
frame
8-9a.m.
2-3p.m
8-9a.m.
1-2p.m
8-9a.m.
5-6p.m
8-9a.m.
5-6p.m
Total number of
vehicles (2hours)
9059 9305 20543 10194
Estimated AADT >50,000 >64,000 >128,000 >51,200
56
Authorized licensed use limited to: University of Exeter. Downloaded on June 09,2020 at 10:38:04 UTC from IEEE Xplore. Restrictions apply.
5. B. Performance Measures
Counting results will be evaluated only after the
completion of each video clip, thus it is expected that
positive and negative counts are cancelling each other
resulting higher counting accuracy. Two different duration of
video clips tested include 5mins and 15mins. Accuracy is
calculated by first finding the error between system output
and the ground truth as in (4). Then the accuracy is the
percentage of (1- error) as in (3).
Accuracy of counting,
Ac=(1-Ec)x100% (3)
Where,
Error, Ec=|GT-actual|/GT (4)
C. Results
a) TrafficVolume Count
Counting accuracy for twelve 5-min clips each in the
morning and afternoon of view #1 are shown in Fig. 7. The
blue and red lines indicate the ground-truth and actual count
from the proposed system respectively, while the yellow bar
is the calculated accuracy. In overall, traffic volume is much
higher during morning between 8 to 9a.m. than afternoon
between 1 to 2p.m. That explained the higher average
accuracy in the afternoon as compared to the morning.
Fig. 7. Ground-truth (GT), actual counting from proposed system and
accuracy results for view#1.
Accuracy results for 15-min clips of view #2, view #3
and view #4 are presented in Table III, Table IV and Table
V respectively. Total of 8 video clips from each camera
views are evaluated. From Table III, it shows that the
counting accuracy for view #2 are consistently above 90%
for all time frames, with the highest accuracy is recorded at
96.2%.
Despite of having the highest traffic volume, view#3
shows higher average accuracy as compared to other views,
with average of 97.7% as in Table IV. Referring to Table V,
average accuracy of view #4 is 94.23%. However, the
minimum accuracy is recorded for clip 8.45 to 9.00 a.m. at
88.8%. The lower accuracy may be due to the scenario that
some vehicles stop at the side of the road before moves
again and cross the virtual counting line. The move-stop-
move scenario may cause problem to the tracker that
affecting the overall accuracy.
TABLE III. TRAFFIC VOLUME COUNT RESULTS OF VIEW#2
TABLE IV. TRAFFIC VOLUME COUNT RESULTS OF VIEW#3
Evaluation period
Morning Afternoon
0800 0815 0830 0845 1300 1315 1330 1345
GT 1409 1527 1419 1344 922 880 906 898
Actual 1540 1626 1508 1395 980 944 955 938
Err % 9.30 6.4 6.27 3.80 6.29 7.27 5.41 4.45
Acc % 90.7 93.5 93.7 96.2 93.7 92.7 94.6 95.6
Average Acc 93.84%
Evaluation period Morning Afternoon
0800 0815 0830 0845 1700 1715 1730 1745
GT 2842 2751 2750 2540 2224 2481 2548 2407
Actual 2849 2776 2714 2574 2345 2412 2480 2447
Err % 0.25 0.91 1.31 1.34 5.44 2.78 2.67 1.66
Acc % 99.8 99.1 98.7 98.7 94.6 97.2 97.3 98.3
Average Acc 97.7%
57
Authorized licensed use limited to: University of Exeter. Downloaded on June 09,2020 at 10:38:04 UTC from IEEE Xplore. Restrictions apply.
6. TABLE V. TRAFFIC VOLUME COUNT RESULTS OF VIEW#4
b) Processing Time
The proposed system has been implemented in C++,
with inference engine of TensorRT is used. For testing the
system performance, Nvidia GeForce RTX 2070, Intel®
Core™ i5-8400 CPU@2.80GHz is used. The processing
time for each of the main steps is shown in Table VI. The
results show that the proposed system can run in real-time
with average processing of 37.27ms per frame equivalent to
26.8 frame-per-seconds (fps).
TABLE VI. PROCESSING TIME RESULTS
Localization &
classification
Tracking Counting Overall processing
Time Fps
27.37ms 9.90ms 0.0009ms 37.27ms 26.8
V. CONCLUSION
In this paper, we have proposed a deep-learning-based
traffic volume count for high-traffic urban arterials road.
The system has been extensively tested with videos captured
from four selected urban arterials with high level of traffic
volume (estimated AADT more than 50,000 and 100,000).
The average vehicle counting for all camera views are
97.68%, 93.84%, 97.7% and 94.23% respectively. The
system is also able to perform in real-time with average
processing time of 37.27ms per frame. In conclusion, the
proposed system is suitable to be used in traffic volume
survey.
REFERENCES
[1] Zawad Khalil, “Traffic volume study,”
https://www.slideshare.net/zawadkhalil/traffic-volume-study-
31672066
[2] NCHRP Web-Only Document 205: Methods and Technologies for
Pedestrian and Bicycle Volume Data Collection.
DOI: https://doi.org/10.17226/22223
[3] K. O. Kusimo1,* and F. O. Okafo. COMPARATIVE ANALYSIS OF
MECHANICAL AND MANUAL MODES OF TRAFFIC SURVEY
FOR TRAFFIC LOAD DETERMINATION, Nigerian Journal of
Technology (NIJOTECH) Vol. 35, No. 2, April 2016, pp. 226 – 233.
[4] Pengjun Zhenga,* , McDonad Mike, An Investigation on the Manual
Traffic Count Accuracy, Procedia - Social and Behavioral Sciences
43 ( 2012 ) 226 – 231.
[5] Gowen and Sanderson, “Accuracy of Pneumatic Road Tube
Counters,”. A report prepared for the 2011 Western District Annual
Meeting Institute of Transportation Engineers Anchorage, AK May
2011
[6] Road Classification and Design Standard in Malaysia, Highway &
Traffic Engineering
[7] https://www.caliper.com/mapping-software-data/aadt-traffic-count-
data.htm
[8] W. Marshall and C. McAndrews, “Does the Livability of a
Residential Street Depend on the Characteristics of the Neighboring
Street Network?,” in Moutain-plains Consortium report (MPC 16-
309).
[9] Daigavane, P.M.; Bajaj, P.R. Real Time Vehicle Detection and
Counting Method for Unsupervised Traffic Video on Highways. Int.
J. Comput. Sci. Netw. Secur. 2010, 10, 112–117.
[10] Xiang, Xuezhi & Zhai, Mingliang & Lv, Ning & El Saddik,
Abdulmotaleb. (2018). Vehicle Counting Based on Vehicle Detection
and Tracking from Aerial Videos. Sensors. 18. 2560.
10.3390/s18082560.
[11] Biswas, D.; Su, H.; Wang, C.; Blankenship, J.; Stevanovic, A. An
Automatic Car Counting System Using OverFeat Framework.
Sensors 2017, 17, 1535.
[12] Gupte, S.; Masoud, O.; Martin, R.F.K.; Papanikolopoulos, N.P.
Detection and Classification of Vehicles. IEEE Trans. Intell. Transp.
Syst. 2002, 3, 37–47.
[13] Ma, X.; Grimson, W.E.L. Edge-based rich representation for vehicle
classification. In Proceedings of the 10th IEEE International
Conference on Computer Vision, Beijing, China, 17–21 October
2005; Volume 2, pp. 1185–1192.
[14] Memon, Sheeraz & Bhatti, Sania & Ali, Liaquat & Talpur, Mir &
Memon, Mohsin. (2018). A Video based Vehicle Detection, Counting
and Classification System. International Journal of Image, Graphics
and Signal Processing. 10. 34-41. 10.5815/ijigsp.2018.09.05.
[15] Gomaa, Ahmed & Abdelwahab, Moataz & Abo-Zahhad, M. &
Minematsu, Tsubasa & Taniguchi, Rin-ichiro. (2019). Robust Vehicle
Detection and Counting Algorithm Employing a Convolution Neural
Network and Optical Flow. Sensors. 19. 4588. 10.3390/s19204588.
[16] Z. Dai et al., "Video-Based Vehicle Counting Framework," in IEEE
Access, vol. 7, pp. 64460-64470, 2019.
[17] Song, H., Liang, H., Li, H. et al. Vision-based vehicle detection and
counting system using deep learning in highway scenes. Eur. Transp.
Res. Rev. 11, 51 (2019).
[18] R. Girshick, ‘‘Fast R-CNN,’’ in Proc. IEEE Int. Conf. Comput. Vis.
(ICCV), Dec. 2015, pp. 1440–1448.
[19] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards
realtime object detection with region proposal networks,’’ IEEE
Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun.
2017.
[20] W. Liu et al., ‘‘SSD: Single shot multibox detector,’’ in Proc. Eur.
Conf. Comput. Vis., 2015, pp. 21–37.
[21] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental
improvement,’’ 2018, arXiv:1
[22] Zhang, Shanghang et al. “FCN-rLSTM: Deep Spatio-Temporal
Neural Networks for Vehicle Counting in City Cameras.” 2017 IEEE
International Conference on Computer Vision (ICCV) (2017): 3687-
3696.
Evaluation period
Morning Afternoon
0800 0815 0830 0845 1700 1715 1730 1745
GT 1412 1309 1236 1224 1162 1283 1282 1286
Actual 1332 1361 1328 1361 1125 1331 1319 1390
Err % 5.67 3.97 7.44 11.2 3.18 3.74 2.89 8.09
Acc % 94.3 96.0 92.6 88.8 96.8 96.2 97.1 91.9
Average Acc 94.23%
58
Authorized licensed use limited to: University of Exeter. Downloaded on June 09,2020 at 10:38:04 UTC from IEEE Xplore. Restrictions apply.