Direct Perception for Congestion Scene Detection Using TensorFlow

Direct Perception for Congestion Scene Detection Using1
TensorFlow™2
Artur Filipowicz
Princeton University
229 Sherrerd Hall, Princeton, NJ 08544
T: +01 732-593-9067
Email: arturf@princeton.edu
(corresponding author)
Jeremiah Liu
New Jersery Institute of Technology
607 Oak Hall, Newark, NJ 07103
T: +01 202-615-5905
Email: jeremiahliu@live.com
Joyoung Lee, Ph.D
Assistant Professor
New Jersery Institute of Technology
264 Tiernan Hall, Newark, NJ 07103
T: +01 973-596-2475
Email: jo.y.lee@njit.edu
3
Word Count: 6,641 words = 3,641 + 3,000 (8 Figures + 4 Tables)4
Paper submitted for consideration for presentation for the 96th TRB Annual Meeting in January5
2017 and for publication in the Transportation Research Record6
1

A. Filipowicz, J. Liu, J. Lee 2
1 ABSTRACT1
In this research, we examine a new approach to the problem of real-time trafﬁc congestion2
detection based on single image analysis. We demonstrate the of a convolutional neural network in3
this domain. With this learning model and the direct perception approach, transforming an image4
directly to a congestion indicator, we design a system which can detect congestion independently5
of location, time and weather. We further demonstrate that the use of the Fast Fourier Transform6
and wavelet transform can improve the accuracy of a convolutional neural network across multiple7
conditions in new locations.8

2 INTRODUCTION1
Traffic flow information is widely used in intelligent transportation systems to detect and2
manage traffic congestion. Collecting traffic data such as speed, count, and occupancy is one of3
the crucial parts in estimating the flow of traffic. Loop detectors, traffic radars, and surveillance4
cameras are used to collect such traffic data. However, due to the inflexibility and high cost of5
deploying loop detectors and traffic radars, video-content-understanding techniques are gaining6
popularity in detecting the flow of traffic. However, video data is often affected by the external7
environment, bad weather (e.g., rain, snow, and fog) and undesirable illumination (e.g., sun glare,8
darkness) conditions. As a result, a challenge in video-based detection is to properly interpret9
video images under such conditions.10
Recently, new advances in the field of machine learning allowed for the creation of more11
reliable computer vision algorithms. These advances include the development of larger models,12
known as Deep Learning (1), and the use of graphical processing units to speed up optimization of13
these models (2, 3). Deep Learning led to successful the use of convolutional neural networks to14
achieve high accuracy on many image (2, 4) and video recognition tasks (5). Furthermore, these15
models have been shown to generalize to new environments and conditions(6, 7).16
In this research, we attempt to address the problem of real-time traffic congestion detection.17
Unlike most previous approaches, we apply ideas from deep learning, computer vision, and signal18
processing to explore the potential of using convolutional neural networks to detect congestion. By19
using a direct perception (8) approach, mapping a single image directly to a congestion indicator,20
we design a system which can detect congestion independently of location, time and weather. We21
also demonstrate the benefit of using signal processing to pre-process images. To the best of our22
knowledge, this is the first study which looks at the performance of convolutional neural networks23
in this domain and the first study which examines the applicability of a single model across multiple24
locations and conditions.25
3 RELATED WORK26
Research efforts in video processing for traffic surveillance and control date back to the27
mid- 1970’s. We find it useful to categorize these systems into direct perception (8) and mediated28
perception (9) concepts discussed in (6).29
Mediated perception approach involves multiple sub-components for scene classification.30
These components could include a vehicles counter and a vehicle speed detector. With mediated31
perception, the estimates of both of these detectors are combined to determine if congestion is32
present. Systems based on mediated perception approaches are convenient for debugging and fine33
tuning. In such systems, we can trace errors back to individual components, and attempt to im-34
prove the system component by component. However, mediated perception also adds unnecessary35
complexity to an already difficult task. There is no clear reason to believe that vehicle detection in36
images is any easier than directly perceiving congestion. Additionally, the combining of outputs37
may introduce more parameters making the system’s performance harder to optimize. In contrast,38
direct perception approach focuses on finding a transformation from a single image directly to a39
congestion indicator. This approach is potentially simpler, more generalizable, and computation-40
ally more efficient. The drawback is that it is more difficult to identify reasons for failure. Direct41

perception appears to be gaining some traction in more recent literature.1
3.1 Mediated Perception2
Most of congestion scene classification techniques are based on mediated perception. One3
popular method used for congestion detection is traffic parameter extraction. Examples can be4
dated back to (10) describing the Autoscope system. The system involves two steps. First step is5
to detect vehicles. This is done by defining an area of interest and placing detection lines along6
or across the roadway lanes, then using a segmentor to estimate the pixel changes on the line7
when a vehicle is present. Next step is to derive traffic parameters such as speed by analyzing8
sequential images. The average accuracy is 95% under ideal conditions. However, the false alarm9
significantly increases under certain situations such as overlap of vehicles. Shadows also negatively10
affect the performance of the system. Cucchiara et al. (11) improved the algorithm presented in11
(10) by separating analysis techniques between day and night. A vehicle tracking accuracy of12
96.9% with 5.8% false negatives, 13.2% false positive during day time, and an accuracy of 96%13
with 4.5% false negative and 4.9% false positive during night time is achieved.14
Along a different line of research, (12) presents a way to measure traffic queue parameter15
by applying a motion detector based on Fast Fourier Transform (FFT) and followed with a vehicle16
detector based on edge detection. The result has a 95% accuracy on determining the length of the17
queue under constant lighting conditions between two frames. Similarly, (13) classified congestion18
scene by extracting the overall crowd density and crowd speed. The result shows 94.5% accuracy in19
single location under daylight condition. No evaluation at other locations and night light condition20
was discussed in this experiment. Background subtraction is convenient to implement. However,21
it is hard to extract a background model in conditions such as sudden light change and heavy22
congestion.23
Another approach to detect congestion scene was explained in (14), where Li et al. pro-24
posed a time-spatial image based method and achieved 93% detection rate for congestion condi-25
tion. According to the author, the false estimation results are due to low contrast image, small26
vehicle block and irregular lane condition. There was no evaluation under other weather condition27
such as rain. Palubinskas et al. (15) approaches traffic congestion detection in sequences of optical28
images based on change detection, image processing and incorporation of apriori information such29
as a traffic model and road network. The accuracy of congestion detection was not reported.30
3.2 Direct Perception31
Many systems developed in the past depend on sequential images, edge detection, and32
thresholding. In most cases the objective is to determine more complicated traffic descriptors than33
a single indicator. Both false positive and false negative rates have to be very low to acquire a34
practical system. Currently, most of the work is still carried out by human operators due to a35
high false alarm rate. Recently, a new direct perception approach is gaining popularity due to36
better performance in adapting to various environments. For example, (16) utilized an efficient37
unsupervised feature learning method with density information encoded. Spherical k-means was38
employed to learn features and a feature selection procedure is followed to remove bad features.39
Locality-constraint coding (LLC) was used to map raw images patches to a new feature space40
and trained a SVM classifier for the classification. The proposed algorithm achieved 85% average41

accuracy. Evaluation of accuracy under different weather and light conditions was not performed.1
Recent work done by (7) is very relevant to our approach. Sepehr et al (7) presented a parking2
stall vacancy detection algorithm base on deep convolutional neural networks. The result showed3
a misclassification rate as low as 5% under varying weather conditions.4
4 METHODOLOGY5
For this study, we collected over 26 thousand images from traffic surveillance cameras,6
labeled the data based on traffic, time, and weather conditions, applied 4 image processing trans-7
formations and used the new TensorFlow library (17) to construct and train convolutional neural8
networks.9
4.1 Dataset10
We collected and labeled 27,833 images from surveillance cameras at 24 locations in New11
Jersey. The dataset contains a variety of traffic patterns and weather conditions grouped into four12
classes, day-clear, day-rain, night-clear and night-rain. Figure 1 shows a sample image from each13
class. We divided the dataset randomly into training and testing sets by location, such that images14
from the same location do not appear in both sets; Tables 1 and 2 detail the exact division of15
examples. We used the training set for training and tuning and the testing set to get the final16
accuracy.17
TABLE 1 : Percent of examples of free flow and congestion in the test and training set
Condition Training Examples Test Examples
Free flow 59% 44%
Congestion 41% 56%
TABLE 2 : Number of examples of each condition in the dataset
Condition Training Examples Test Examples
Free flow Day Clear 3,098 1,100
Free flow Day Rain 3,131 1,057
Free flow Night Clear 2,960 1,200
Free flow Night Rain 2,122 599
Congestion Day Clear 2,443 1,852
Congestion Day Rain 1,395 2,287
Congestion Night Clear 2,460 562
Congestion Night Rain 1,326 241
Total 18,935 8,898

Free Flow Day Clear Free Flow Day Rain Free Flow Night Clear Free Flow Night Rain
Congestion Day Clear Congestion Day Rain Congestion Night Clear Congestion Night Rain
FIGURE 1 : Sample images of all 9 conditions in our dataset.
4.2 Image Processing1
Before training, we scaled the images to 140x100 pixels and applied 4 different image pro-2
cessing transformations. These are grayscale, Fast Fourier Transform (FFT), wavelet transform3
(WT), and our mixture of wavelet and Fast Fourier Transform. For this research, it is important to4
understand the output of these transformations at a high level. The low level details and mathemat-5
ical equations are not illustrative and we therefore omit them.6
In a grayscale image as shown Figure 2(a), each pixel represents intensity of light at that7
location. The basic gray scaling of the images serves a baseline and it is necessary as most images8
are already black and white. In FFT in Figure2(b), the image is represented as a collection of9
amplitudes and phases. For our study, we discard the phases and focus on the amplitudes. In10
this representation, amplitudes of high frequency components correspond to sharp changes in the11
image. When they are removed, the image becomes blurred. In Figure 2(b) the high frequency12
components are away from the center of the image. The wavelet transform (18, 19) at a high13
level, produces four representations of the image. One of the four representations is a compressed14
version of the original image, which we discard as it carries information similar to the grayscale15
image. The other three representations highlight vertical, horizontal, and diagonal edges in the16
image as shown in Figure 2(c). Our WT+FFT transformation appends the WT output to the FFT17
output as shown in Figure 2(d). The FFT output is also scaled such that the range (max - min)18
of FFT matches the range of WT. For a general introduction to the FFT, one can refer to (20) and19
for a deeper treatment, refer to (21). In the discussion section we analyze how the properties of20
these transforms maybe useful for congestion detection. After applying these transformations, we21
standardized the training set by subtracting the mean and dividing by the standard deviation of22
each pixel.23

(a) Grayscale (b) FFT
(c) WT (d) WT+FFT
FIGURE 2 : Outputs of the four transformations.

4.3 TensorFlow™1
TensorFlow is a library used to build and train gradient-based machine learning models2
which provides a convenient interface for expressing machine learning algorithms by converting3
described computations into a data flow graph (17). Data flow graphs describe mathematical com-4
putation with a directed graph of nodes and edges. Nodes represent mathematical operations as5
well as data inputs and persistent variables. Edges describe the input/output relationships between6
nodes in form of multidimensional data arrays. Computation can be distributed across multiple7
CPUs and GPUs (17).8
4.4 Convoluntional Neural Network9
For our experiments we selected a convolutional neural network model. A convolutional10
neural network is a machine learning model which attempts to mimic the function of the human11
vision system. These networks tend to perform well on vision tasks because some of the parameters12
the network learns are arranged into filters. These filters are convolved with the input image; the13
filter is moved across the image and, at regular intervals, multiplied with the pixel values in that14
region. The effective result of this is that the network can match a particular pattern represented15
by a filter to all areas in an image. This allows the network to detect objects in different locations.16
With multiple layers of such filters, networks can detect objects even when the size of the object17
varies. These filters can serve as simple edge detectors or detect more complicated patterns such18
as facial features. Visualizations of filters can be found in (22).19
We constructed a network with one convolution layer of 8 5x5 filters. This layer is followed20
by a max pooling and normalization layers. Then our model has 3 fully connected layers with21
output sizes 389, 192, and 1. All layers use ReLU activation function (2, 23), except for the22
last layer which uses a softmax function. During training, we employ dropout, weight delay and23
leaning rate decay to avoid overfitting and improve results. Additionally, for grayscale images we24
apply single image whitening and data augmentation techniques of randomly changing contrast25
and brightness.26
5 RESULTS27
We trained the above model on each of the four transformations. Table 3 lists the accuracy28
on the test set for each transformation. While no transformation is superior in all conditions,29
FFT, WT and WT+FFT outperform grayscale in most conditions. Since the accuracy for free flow30
and congestion in day-clear appear to be low for what seems to be an easy condition, we also31
trained a network in just those conditions. The performance of that network on the test set in those32
conditions is about 20% better as summarized in Table 4.33

(a) Grayscale (b) FFT
(c) WT (d) WT+FFT
FIGURE 3 : Accuracy on the test set.

TABLE 3 : Accuracy on the test set
Condition Grayscale FFT WT WT+FFT
Free Flow Day Clear 0.637 0.970 0.650 0.650
Free Flow Day Rain 0.996 0.995 0.998 0.986
Free Flow Night Clear 0.678 0.965 1.000 0.820
Free Flow Night Rain 0.534 0.998 0.973 0.947
Congestion Day Clear 0.457 0.604 0.455 0.645
Congestion Day Rain 0.724 0.438 0.835 0.846
Congestion Night Clear 0.558 0.538 0.937 0.993
Congestion Night Rain 0.091 0.842 0.062 0.008
TABLE 4 : Accuracy for training and testing under day-clear condition
Condition Grayscale
Free Flow Day Clear 0.85
Congestion Day Clear 0.98
6 DISCUSSION1
The accuracies in our results vary from very low to very high. This shows the general2
difficulty of the congestion recognition problem. A detection system may perform extremely well3
in one condition and location but not another. It is also apparent that a raw image, which is easy4
for a human to understand, may not be the best representation for a machine to detect congestion.5
Let us better understand these results in the context of what each transformation is representing.6
In a grayscale image, each pixel represents intensity of light at that location. We could7
speculate that the amount of light in an image corresponds to number of vehicle headlights or tail-8
lights. Perhaps vehicles being brighter than the pavement could also correlate their number to the9
brightness in the image. However, as seen in Figure 5, under certain conditions such as rain at10
night, even free flow traffic can appear to be very bright.11
In our test, a network trained on grayscale images functions better in rain and at night12
than during the day. Suggesting that the network correlated its output to lights in the image. The13
network is confused by free flow day clear images and congestion at night in rain. A possible14
explanation for this is that the day free flow images in the test set had very light road surface and15
very few vehicles. They are bright but for a different reason and bright enough for the network to16
think the image is of congestion. A similar reason might apply to poor performance in the night17
congestion during rain. A difficulty with this analysis is that the network uses a very complicated18
nonlinear decision function. Thus, our idea of brightness is a qualitative observation.19

a b
FIGURE 4 : Filters learned from all conditions (a) and filters learned under the day-clear condition
(b).
FIGURE 5 : Free flow and congestion condition with a lot of brightness in the image.
To further explore this question, we trained our network on images from the training set1
which were taken on clear days. We then tested this network on day-clear images in the test set.2
The results improved dramatically form 65% and 45% to 85% and 98% accuracy on free flow and3
congestion respectively, as shown in Table 4. The filters of these two networks in Figure 4 appear4
very different as well. The filters for the network trained in all conditions look more like they5
detect sharp transitions between dark and light areas. While the filters for the network trained in6
just the day-clear condition look more uniform. They are mostly white or black. This suggests that7
for day conditions, the network learned to look for open road. When other conditions are present,8
a more optimal solution is to look for sharp transitions such as headlights.9
Since brightness is not a clear indicator of congestion, we decided to use transformations10
which depend more on edges and textures in an image. The idea to use other transformations11
comes from simulation results in (24) and the properties of the transforms.12
The Fast Fourier Transform and wavelet transform produce a frequency representation of13
an image. In FFT high frequency components correspond to sharp changes in the image. There14

may be some correlation between high frequencies and amount of cars in an image which a network1
could learn. For example, an empty road has fewer sharp transitions than a road with many cars on2
it. Figure 6 shows the FFT output from a free flow and congested image. The FFT of congested3
image has vertical stripes in the high frequency regions. The network trained on FFT images4
performs much better than the network trained on gray scale images across all conditions expect5
for congestion during day-rain and night-clear. Additionally, congestion day-clear and congestion6
night-clear have low accuracy for unknown reason.7
Free Flow Congested
FIGURE 6 : The FFT output for an image with congestion and free flow.
The wavelet transform, at a high level, produces four representations of the image. One of8
the four representations is a compressed version of the original image. The other three highlight9
vertical, horizontal, and diagonal edges in the image in Figure 2(c). The wavelet transform can10
highlight vehicles as vertical edges. With this transformation, the network improved in two con-11
gestion conditions but suffered the same performance in free flow day-clear and congestion nigh-12
rain as the gray scale version. With respect to WT, we also discovered that horizontal shadows as13
appeared in Figure 7 for an example, can look like vehicles.14
In an attempt to get the best of FFT and WT, we combined the output of both transfor-15
mations into a single input image in Figure 2(d). This combination failed at the congestion night16
rain conditions. However, it performed batter than grayscale images in all other conditions. It17
also overcame some of the shortcomings of FFT and WT alone while keeping high accuracy. This18
suggests that FFT and WT capture different and very important features for congestion detection.19
It is important to note that some examples are difficult even for a human to distinguish. In20
Figure 8 rain drops on the camera make it difficult to see the traffic. The scattered light might make21
it appear that there are many vehicles in gray scale or very few in WT or FFT as the textures and22
edges get blurred. In many examples the images show congestion in one direction and free flow23
in the other, Figure 9. In our approach we label any image which has congestion anywhere as an24
example of congestion. However, a network could still become confused.25

FIGURE 7 : Empty road with a horizontal shadow which can look like a vehicle in a WT transform
output.
FIGURE 8 : Rain drops on a camera can blur the detail of the image.
FIGURE 9 : In some images congested and free ﬂow lanes are visible which can confuse a net-
work.

7 CONCLUDING REMARKS1
In this research, we attempt to address the problem of real-time traffic congestion detection.2
By using a direct perception approach, mapping a single image directing to a congestion indicator,3
we design a system which can detect congestion independent of location, time and weather. We4
demonstrate that the use of FFT and WT with a convolutional neural network can produce high5
accuracy across multiple conditions in new locations.6
These results are promising but still exploratory. Future research steps in this area include7
the creation of a larger dataset and training for larger networks. This set should include tens of8
thousands to hundreds of thousands of images from 50 or more locations. This would allow for9
more thorough testing. Additionally, a convolution neural network with more layers may improve10
results. This model would require more data and computational power then was available for this11
research. Considering the success of such models in other domains and the promising results in this12
study, the performance of a larger convolution neural network would be interesting to investigate.13
References14
[1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–15
444, 2015.16
[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep17
convolutional neural networks. In Advances in neural information processing systems, pages18
1097–1105, 2012.19
[3] Dave Steinkrau, Patrice Y Simard, and Ian Buck. Using gpus for machine learning algo-20
rithms. In Proceedings of the Eighth International Conference on Document Analysis and21
Recognition, pages 1115–1119. IEEE Computer Society, 2005.22
[4] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for hu-23
man action recognition. IEEE transactions on pattern analysis and machine intelligence,24
35(1):221–231, 2013.25
[5] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and26
Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceed-27
ings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732,28
2014.29
[6] Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. Deepdriving: Learning affor-30
dance for direct perception in autonomous driving. In Proceedings of the IEEE International31
Conference on Computer Vision, pages 2722–2730, 2015.32
[7] Sepehr Valipour, Mennatullah Siam, Eleni Stroulia, and Martin Jagersand. Parking stall33
vacancy indicator system based on deep convolutional neural networks. arXiv preprint34
arXiv:1606.09367, 2016.35
[8] James J Gibson. The ecological approach to visual perception: classic edition. Psychology36
Press, 2014.37

[9] Shimon Ullman. Against direct perception. Behavioral and Brain Sciences, 3(03):373–381,1
1980.2
[10] Panos G Michalopoulos. Vehicle detection video through image processing: the autoscope3
system. IEEE Transactions on vehicular technology, 40(1):21–29, 1991.4
[11] Rita Cucchiara, Massimo Piccardi, and Paola Mello. Image analysis and rule-based reasoning5
for a traffic monitoring system. IEEE Transactions on Intelligent Transportation Systems,6
1(2):119–130, 2000.7
[12] M Fathy and MY Siyal. Real-time image processing approach to measure traffic queue pa-8
rameters. IEE Proceedings-Vision, Image and Signal Processing, 142(5):297–303, 1995.9
[13] Luciano Oliveira Andrews Sobral, Leizer Schnitman, and Felippe De Souza. Highway traffic10
congestion classification using holistic properties.11
[14] Li Li, Long Chen, Xiaofei Huang, and Jian Huang. A traffic congestion estimation approach12
from video using time-spatial imagery. In Intelligent Networks and Intelligent Systems, 2008.13
ICINIS’08. First International Conference on, pages 465–469. IEEE, 2008.14
[15] Gintautas Palubinskas, Franz Kurz, and Peter Reinartz. Detection of traffic congestion in15
optical remote sensing imagery. In IGARSS 2008-2008 IEEE International Geoscience and16
Remote Sensing Symposium, volume 2, pages II–426. IEEE, 2008.17
[16] Yuan Yuan, Jia Wan, and Qi Wang. Congested scene classification via efficient unsupervised18
feature learning and density estimation. Pattern Recognition, 56:159–169, 2016.19
[17] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,20
Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale21
machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467,22
2016.23
[18] Arne Jensen and Anders la Cour-Harbo. Ripples in mathematics: the discrete wavelet trans-24
form. Springer Science & Business Media, 2001.25
[19] Adrian S Lewis and G Knowles. Image compression using the 2-d wavelet transform. IEEE26
Transactions on image Processing, 1(2):244–250, 1992.27
[20] GD Bergland. A guided tour of the fast fourier transform. IEEE spectrum, 6(7):41–52, 1969.28
[21] Henri J Nussbaumer. Fast Fourier transform and convolution algorithms, volume 2. Springer29
Science & Business Media, 2012.30
[22] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional deep belief31
networks for scalable unsupervised learning of hierarchical representations. In Proceedings32
of the 26th annual international conference on machine learning, pages 609–616. ACM,33
2009.34

[23] Vinod Nair and Geoffrey E Hinton. Rectiﬁed linear units improve restricted boltzmann ma-1
chines. In Proceedings of the 27th International Conference on Machine Learning (ICML-2
10), pages 807–814, 2010.3
[24] Artur Filipowicz, Thee Chanyaswad, and S. Y. Kung. Filtering of frequency-transformed4
images privacy-preserving face recognition. 2016. submitted to MLSP2016.5

Direct Perception for Congestion Scene Detection Using TensorFlow

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Direct Perception for Congestion Scene Detection Using TensorFlow

Similar to Direct Perception for Congestion Scene Detection Using TensorFlow (20)

More from Artur Filipowicz

More from Artur Filipowicz (9)

Recently uploaded

Recently uploaded (20)

Direct Perception for Congestion Scene Detection Using TensorFlow