Unsupervised Object Detection

DIVAR - 2ND INTERVIEW MINI PROJECT 1
Object Localization With Classification Networks
Mahan Fathi
UnsupervisedObjectLocalization
I. PRELIMINARIES
This is a mini project as the second phase of the interview
process, with the purpose of object localization on the raw
image dataset of Divar, and by ‘raw,’ I mean there is no
tagging or ground-truth bounding box available. Of course,
running off-the-shelf trained networks on the dataset is the
first thing that comes to mind, but doing so, I infer, is not
exactly in-line with the interviewer expectations. There’s not
much to passing image batches through a trained network and
storing the outputs. So I opt to go with a more demanding
method, that might entail more familiarity with neural nets
and auto-differentiation frameworks. As a result, here I
present an algorithm which is capable of producing fairly
nice bounding boxes, by only using good old get-at-able
classification networks.
IMPORTANT: One other thing I would like to mention is
that, my dearest grandfather passed away only a few hours
after I received the project mail and countdown. I could not
even make time to report the matter to you, and did not know
if it was the right thing to do, but now I appreciate if you
bear in mind that this is the result of the work done over a
weekend and I had a really tough time getting around this in
such a short time.
II. INTRODUCTION
GUIDED Back-propagation has proven to be a fast and
interpretable util for visualizing the parts of the image
to which a specific neuron fires the most — not exactly,
just waving hands here. This is done by gating the gradients
through Relu units when back-propagating from a single
neuron all the way back to the input image. Classification
networks store valuable spatial information about the picture
and the idea is to make the most out of these gradients for the
task of object localization when they correspond to the object.
By targeting the neurons with the biggest positive impact on
the classification score, we should be able to spot the parts of
the image which we infer to confirm with the object.
III. METHOD OVERVIEW
An introduction to the method was given in the previous
section, here I go into more detail. The first thing we need is
a classification network. I picked VGG16 for Its simple archi-
tecture, availability, and Its relative tininess for the better of
my limited vram. As an algorithm hyper-parameter, we choose
the layer from which the top neurons are drawn. After exper-
imenting with the output, I finally settled block5_conv2
for the inputted neurons. One major perk of back-propagating
from an intermediate layer, and not final feature vectors, also
referred to as net embeddings or fc7 features, is that it incites
more generality and flexibility for the algorithm to work on a
wider range of image categories. To dive into the algorithm,
our first step is to pass the image through the network to
calculate the class scores. Assuming that VGGNET outputs
the right class for the image, the aforementioned top neurons,
are those with the biggest contribution to the maximum class
score. These neurons are later guided back-propagated to the
original image and only a handful of them make it to the mask
generation step, which will be further described later in this
reading. These masks are finally joined and the bounding box
for the object is simply the smallest rectangle enclosing this
area.
Below an overall representation of the algorithm is provided
in pseudo-code. Each step is later attended to individually.
Algorithm 1 Object Localization with VGGNET
1: procedure LOCALIZATIONWITHVGG
2: Pass image through VGGNET to obtain the classifi-
cation
3: Identify kmax most important neurons via DAM
heuristic
4: Use Guided Back-propagation to map neurons back
into the image
5: Generate masks from gradients’ saliency maps and
apply to image separately
6: Pass resulted images once again to get class scores,
pick top final k neurons
7: Join final masks and find enclosing bounding box
IV. METHOD DESCRIPTION
A. Passing image through net
We need the class scores – the top class in particular –
and the feature layer activations later. Although my narration
makes it look like these steps are performed sequentially,
but it is important to note that, these computations are not
actually carried out in that way — the network is wired up in
TensorFlow.
B. Finding kmax neurons
As it was mentioned in the previous section, we focus
on the input neurons to the layer of block5_conv2. We
need a notion of importance to select the kmax neurons. This
selection is necessary, because there are approximately around
1000 neurons in this layer and back-propagating from all of

them is not computationally practical. So we have to make a
subset with the size of kmax and back-propagate from them.
To introduce a notion of importance, I use the DAM heuristic
which is proportional to the activation of the neuron and the
top class score differentiation of its activation. So I form this
matrix for the input layer of block5_conv2 and get kmax
neuron indices with highest values:
activations
dtopClassScore
dactivations
I have set kmax to be 10. These 10 are then considered for
back-propagation.
C. Guided Back-propagating from neurons to image
VGGNET uses Relu for the non-linearity units and Guided
Back-propagation makes the differentiation of these units a
tad different — the back-propagating signal on these units
additionally must be thresholded on zero. A nice handle is
implemented in the code that makes me able to switch from
normal to guided back-propagation whenever I want. Here this
switch is activated. The output of the guided back-propagation
is a matrix with the size of the original input image. So now we
have kmax or 10 different images, and each one of them tends
to different parts of the original image. The negative saliency
map of a guided back-propagated gradients are shown in Fig
1.
D. Generating masks
To clean up the back-propagated gradients, only pixel values
that fall into a certain percentile are kept – their value is set
to one for every channel – and the rest are set to zero. This
binary image is sent through the morphological operations of
dilation and erosion respectively, to ensure that there are no
tiny islands and holes of active pixels on the mask. This is
procedure is carried out for every one of the kmax neurons.
E. Selecting top k neurons
These kmax neurons are separately applied on the image
and kmax masked images are produces. These images are
once again passed through the CNN, and k masks/images,
corresponding to k different neurons, with least Softmax
Classification Loss are selected. Here again, the ground-truth
class is supposed to be the VGG output for the original image.
F. Spitting out bounding box
The bounding box is now simply the smallest rectangle
that encloses the united area of all k top masks. See the red
bounding box in Fig. 1.
V. VALIDATION METHOD
I would normally compare the generated bounding box and
the ground-truth by Intersection over Union (IoU) metric.
However as I have already addressed the issue with the dataset,
the possibilities here for the validation procedure are very
limited. I finally decided to compare the classification score
of the cropped image of the bounding box with the original
Fig. 1. Negative Saliency Map of Guided Back-propagated gradients.
image. This might strike you as a self-fulfilling prophecy, as
I am in some way maximizing this very score by picking the
neurons with maximum contribution to it. To resolve this issue,
one could use a second network for validation, which makes
sense to me. I went with ResNet-50. Since both networks are
trained over ImageNet, it is straightforward to map classes
together. Table I summarizes the validation results. Mind that
VGGNET results for original images are once again treated
as the ground-truth here.
TABLE I
VALIDATION RESULTS
Input Images VGG16 ResNet − 50
Original 100.00% 70.20%
Bounding Box Cropped 56.60% 42.20%
I sampled 100 images from electronics and vehicles, per
category, 250 from personal, and 50 from for-the-home to
form a validation dataset of the size of 500. Then I cropped
the bounding box by setting the outlying pixel values to zero
and cached them to disk. Dropping to 70.20% when changing
the network might imply that the dataset is not the healthiest
dataset out there. Nevertheless, bounding box cropped results
look quite impressive to me!

Fig. 2. t-SNE representation of the personal category.
VI. T-SNE REPRESENTATION
I would like to briefly refer to the t-SNE representation of
the dataset using fc7 embeddings of VGG16, which is shown
in Fig. 2. Notice how similar photos cluster up regionally. It
is very useful to have a glance at the dataset and infer some
cornerstone facts for designing the algorithm. So this was the
first thing I did. The result is easy on the eyes by the way. You
can also find a larger t-SNE picture of the personal category
with more number of tiles in the attachment.
VII. ALGORITHM PROS AND CONS
• Pros:
– Collages or photos containing multiple objects are
handled nicely, k is high enough to detect objects
from all over the picture.
– Algorithm outputs full size of the image as the
bounding box, when it encounters a dull/monotone
image. These kinds of bounding boxes are more
frequent in for-the-home category.
• Cons:
– Multiple neurons might tend to a specific part of the
image. For instance, it turns out that neurons are very
sensitive to car wheels. One solution is to increase
k.
– Some parts of the algorithm cannot be parallel pro-
grammed. This might slow the computations a little
bit.
VIII. ABOUT THE CODE
• Find the code here: https://github.com/MahanFathi/
UnsupervisedObjectLocalization.
• Dependencies: TensorFlow, Numpy, Scipy,
Scikit-learn, matplotlib, LAPJV.
Fig. 3. Results.

Unsupervised Object Detection

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Unsupervised Object Detection

Similar to Unsupervised Object Detection (20)

Recently uploaded

Recently uploaded (20)

Unsupervised Object Detection