2. INTRODUCTION
• Applications such as surveillance, traffic management, and
rescue tasks.
• Prevents traffic jams and congestions which in turn reduces
air and noise pollution.
• Surveillance - to making right decisions.
29-09-2018
2
3. CHALLENGES
• Small size of the vehicles
• Different types and orientations
• Similarity in visual appearance of vehicles and some other objects
• Detection time in very high resolution images
29-09-2018
3
5. FIXED-GROUND SENSORS
• Information collected efficiently using different types of fixed ground
sensors
• Such as stationary camera, radar sensors, bridge sensors, and
induction loop.
• Partial overview about vehicles density, parking lots situation, and
traffic flow.
• For road network monitoring and planning, traffic statistics, and
optimization.
29-09-2018
5
6. IMAGE-BASED SENSORS
• Two sources: satellites and airplanes or unmanned aerial vehicles
(UAV).
• Gives an overall overview of traffic situation in the area of interest.
• Satellites provide images with submeter spatial resolution.
• Aerial images provide a higher spatial resolution of 0.1 to 0.5m
• Easier data acquisition, low cost, fast acquisition of images, and
environment-friendliness
• Supervised learning problem, by convolutional neural network (CNN)
29-09-2018
6
7. CONVOLUTIONAL NEURAL NETWORK
• Deep, feed-forward artificial neural networks
• In traditional algorithms features were hand-engineered.
• Independence from prior knowledge and human effort in feature design
• Consists of an input and an output layer, as well as multiple hidden layers.
• The hidden layers consist of convolutional layers, pooling layers, fully
connected layers and normalization layers
29-09-2018
7
9. CONVOLUTIONAL LAYER
• Core building block of a CNN.
• Consist of a set of learnable filters or kernels
• The network learns filters that activate when it detects some specific type
of feature at some spatial position in the input.
• Stacking the activation maps for all filters along the depth dimension
forms the full output volume of the convolution layer.
29-09-2018
9
12. • The value of each filter is learned during the training process.
• Find more meaning from images
• By stacking layers of convolutions on top of each other, more abstract and
in-depth information from a CNN.
29-09-2018
12
14. CONV2D
• Most common type of convolution layer
• Extend through the three channels in an image (Red, Green, and Blue).
• After the convolutions are performed individually for each channels, they
are added up to get the final convoluted image.
• The output of a filter after a convolution operation is called a feature map
29-09-2018
14
16. • Each filter in this layer is randomly initialized to some distribution
(Normal, Gaussian, etc.).
• By having different initialization criteria, each filter gets trained slightly
differently.
• Random initialization ensures that each filter learns to identify different
features.
• Each successive layer can have two to four times the number of filters in
the previous layer. This helps the network learn hierarchical features.
29-09-2018
16
17. RELU
• ReLU is the abbreviation of Rectified Linear
Units.
• This layer applies the non-saturating activation
function 𝑓 𝑥 = max(0, 𝑥).
• It increases the nonlinear properties of the
decision function and of the overall network
29-09-2018
17
19. POOLING
• Non-linear down-sampling.
• Max Pooling, Average Pooling, Sum Pooling
• Partitions the input image into a set of non-overlapping rectangles and, for
each such sub-region, outputs the maximum.
• Exact location of a feature is less important than its rough location relative
to other features.
29-09-2018
19
20. • Reduce the spatial size of the
representation,
• Reduce the number of parameters
and amount of computation in the
network
• Insert a pooling layer between
successive convolutional layers in a
CNN architecture
29-09-2018
20
22. FULLY CONNECTED LAYER
• Finally, after several convolutional and max pooling layers, the high-level
reasoning in the neural network is done via fully connected layers.
• Neurons in a fully connected layer have connections to all activations in
the previous layer, as seen in regular neural networks.
29-09-2018
22
24. CNN SUMMARY
• INPUT will hold the raw pixel values of the image,
Ex: An image of width 32, height 32, and with three color channels R,G,B.
• CONV layer will compute the output of neurons that are connected to local
regions in the input,
Result in volume such as [32x32x12] if we decided to use 12 filters.
• RELU layer will apply an elementwise activation function, such as the max(0,x)
thresholding at zero.
This leaves the size of the volume unchanged ([32x32x12]).
29-09-2018
24
25. • POOL layer will perform a downsampling operation along the
spatial dimensions (width, height),
resulting in volume such as [16x16x12].
• FC (i.e. fully-connected) layer will compute the class scores,
Resulting in volume of size [1x1x10], where each of the 10
numbers correspond to a class score
29-09-2018
25
27. VGG16
• Neural network that performed very well in the Image Net Large Scale
Visual Recognition Challenge (ILSVRC) in 2014.
• Scored first place on the image localization task and second place on the
image classification task.
• It can detect any one of 1000 images
• It takes input image of size 224 * 224 * 3 (RGB image)
• Total 16 layers
29-09-2018
27
30. FULLY CONVOLUTIONAL REGRESSION
NETWORK
• To solve vehicle detection and counting problem
• FCRN has two paths: downsampling path and up-sampling path.
• The down-sampling path is the pre-trained VGG-16 network .
• Consists of repeated padded 3 x3 convolutions followed by rectified linear
unit (ReLU) and a max pooling operation.
• The layers up to 'conv5' from VGG-16 network.
29-09-2018
30
31. • De-convolution layer and up sampling is done
• Batch normalization is done for fast convergence
• The input is an image and the output is a
density map.
• Accurate vehicles detection and localization
29-09-2018
31
33. SOLUTION
• Using CNN
• Mapping function between an image I (x) and a density map D(x),denoted
as F : I (x)
• 𝐹: 𝐼 𝑥 → 𝐷 𝑥 𝑤ℎ𝑒𝑟𝑒 𝐼𝜖𝑅 𝑚𝑥𝑛 𝑎𝑛𝑑 𝐷 𝜖𝑅 𝑚𝑥𝑛
29-09-2018
33
35. • a,b,c are the elements of the positive-definite matrix
•
𝑎 𝑏
𝑏 𝑐
and used for generating rotated ground-truth.
• x and y are inferred from the width and height of the vehicle, and 𝜃 is the
orientation of the vehicle
29-09-2018
35 GROUND TRUTH PREPARATION
37. TRAINING THE NETWORK
• During training, an input image and its corresponding ground truth are
given to the FCRN
• To minimize the error between the ground truth and predicted output.
• During inference, the output of the trained model goes under an empirical
thresholding
• Simple connected component algorithm is used for returning the count and
the location of the detected vehicles.
29-09-2018
37
38. TRAINING THE NETWORK
• During training phase, 224x224 random patches were selected from the
aerial image.
• The selected patch contains at least one vehicle.
• Thus, patches with no vehicles were not chosen during training.
• To increase the amount of training examples, data augmentation
techniques were utilized
29-09-2018
38
40. MEAN SQUARE ERROR FUNCTION
• X is the input patch with M samples, ∅ are all trainable parameters,
• YP is the predicted density map, and YT is the ground truth annotation.
29-09-2018
40
42. DATASET
• DLR Munich vehicle dataset provided by Remote Sensing Technology Institute of
the German Aerospace Center and Overhead Imagery Research Data Set (OIRDS)
dataset
• Munich dataset contains 20 images (5616 x 3744 pixels) taken by DLR 3K camera
system at a height of 1000 m above the ground over the area of Munich, Germany.
• This dataset contains 3418 cars and 54 trucks annotated in the training image set
and 5799 cars and 93 trucks annotated in testing image set.
• OIRDS dataset contains 907 aerial images with approximately 1800 annotated
vehicles
29-09-2018
42
43. Examples of aerial images in Munich dataset (first row) and OIRDS dataset (second
row).
29-09-2018
43
44. Munich dataset. Green represents true positive cases, yellow represents false negative
cases, and red represents false positive cases.
29-09-2018
44
45. OIRDS dataset. Green represents true positive cases, yellow represents false negative
cases, and red represents false positive cases.
29-09-2018
45
46. Fig: the input patch, the ground truth, the predicted density map, the result of applying thresholding and
connected component algorithm, finding all vehicles successfully
29-09-2018
46
47. REFERENCES
• [1]. Z. Zheng, X. Wang, G. Zhou, and L. Jiang, ``Vehicle detection based on morphology
from highway aerial images,'' in Proc. IEEE Int. Geosci. Remote Sens. Symp., Jul. 2012,
pp. 59976000.
• [2] J. Leitloff, S. Hinz, and U. Stilla, ``Vehicle detection in very high resolution satellite
images of city areas,'' IEEE Trans. Geosci. Remote Sens., vol. 48, no. 7, pp. 27952806,
Jul. 2010.
• [3] X. Jin and C. H. Davis, ``Vehicle detection from high-resolution satellite imagery
using morphological shared-weight neural networks,'' Image Vis. Comput., vol. 25, no. 9,
pp. 14221431, 2007.
• [4] R. Ruskone, L. Guigues, S. Airault, and O. Jamet, ``Vehicle detection on aerial
images: A structural approach,'' in Proc. 13th Int. Conf. Pattern Recognit., vol. 3. Aug.
1996, pp. 900904.
29-09-2018
47