Final Report - Major Project - MAP

Malicious Activity Prediction for Public
Surveillance using Real-Time Video
Acquisition
A Project Report
submitted by
Abhilash Dhondalkar (11EC07)
Arjun A (11EC14)
M. Ranga Sai Shreyas (11EC42)
Tawfeeq Ahmad (11EC103)
under the guidance of
Prof. M S Bhat
in partial fulﬁlment of the requirements
for the award of the degree of
BACHELOR OF TECHNOLOGY
DEPARTMENT OF ELECTRONICS AND COMMUNICATION
ENGINEERING
NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA
SURATHKAL, MANGALORE - 575025
April 15, 2015

ABSTRACT
Criminal activity is on the rise today, from petty crimes like pick pocketing, to major
terrorist activities like the 26/11 attack among many others, posing a threat to the safety
and well-being of innocent citizens. The aim of this project is to implement a solution to
detect and predict criminal activities for real time surveillance by sensing irregularities
like suspicious behaviour, illegal possession of weapons and tracking convicted felons.
Visual data has been gathered, objects such as faces and weapons has been recognised
and techniques like super-resolution and multi-modal approaches towards semantic de-
scription of images has been applied to enhance the video, and to categorise the unusual
activity if detected. A key phrase coherent to the description of the scene inherently
detects the occurrence of all such activities and a record of such descriptions is stored
in a database corresponding to individuals. Neural networks are implemented to further
associate the activities with actual unlawful behaviour.
i

TABLE OF CONTENTS
ABSTRACT i
1 Introduction 1
1.1 Problem deﬁnition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Super Resolution 4
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Certain Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Recursive Least Squares . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Spatial Domain Methods . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Projection and Interpolation . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Forward Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Inverse Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Advantages of our solution . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.1 Approach used . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
ii

2.5.2 Combinatorial Motion Estimation . . . . . . . . . . . . . . . . . 21
2.5.3 Local Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Face Detection and Recognition 25
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Face detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Computation of features . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Learning Functions . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Recognition using PCA on Eigen faces . . . . . . . . . . . . . . . . . . 30
3.3.1 Introduction to Principle Component Analysis . . . . . . . . . . 32
3.3.2 Eigen Face Approach . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.3 Procedure incorporated for Face Recognition . . . . . . . . . . . 33
3.3.4 Significance of PCA approach . . . . . . . . . . . . . . . . . . . 34
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 Object Recognition using Histogram of Oriented Gradients 37
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Theory and its inception . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Algorithmic Implementation . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.1 Gradient Computation . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.2 Orientation Binning . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.3 Descriptor Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.4 Block Normalization . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.5 SVM classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Implementation in MATLAB . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.1 Cascade Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 42
iii

4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Neural Network based Semantic Description of Image Sequences using
the Multi-Modal Approach 44
5.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.2 Modelling an Artificial Neuron . . . . . . . . . . . . . . . . . . . 46
5.1.3 Implementation of ANNs . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Convolutional Neural Networks - Feed-forward ANNs . . . . . . . . . . 50
5.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.2 Modelling the CNN and it’s different layers . . . . . . . . . . . . 51
5.2.3 Common Libraries Used for CNNs . . . . . . . . . . . . . . . . 53
5.2.4 Results of using a CNN for Object Recognition . . . . . . . . . 53
5.3 Recurrent Neural Networks - Cyclic variants of ANNs . . . . . . . . . . 54
5.3.1 RNN Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.2 Training an RNN . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Deep Visual-Semantic Alignments for generating Image Descriptions - CNN
+ RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4.1 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4.2 Modelling such a Network . . . . . . . . . . . . . . . . . . . . . 61
5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6 Database Management System using MongoDB for Face and Object
Recognition 68
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.2 CRUD Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
iv

6.2.1 Database Operations . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2.2 Related Features . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.3 Read Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2.4 Write Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.2 Index Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7 Final Results, Issues Faced and Future Improvements 80
7.1 Final Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.3 Issues Faced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.4 Future Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.5 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
v

LIST OF FIGURES
2.1 Image Clarity Improvement using Super Resolution . . . . . . . . . . . 4
2.2 Forward Model results . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Under-regularized, Optimally regularized and Over-regularized HR Image 17
2.4 Plot of GCV value as a function of λ . . . . . . . . . . . . . . . . . . . 18
2.5 Super resolved images using the forward-inverse model . . . . . . . . . 18
2.6 Image pair with a relative displacement of (8/3, 13/3) pixels . . . . . . 21
2.7 Images aligned to the nearest pixel (top) and their difference image (bot-
tom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 Block diagram of Combinatorial Motion Estimation for case k . . . . . 22
3.1 Example rectangle features shown relative to the enclosing window . . . 26
3.2 Value of Integral Image at point (x,y) . . . . . . . . . . . . . . . . . . . 27
3.3 Calculation of sum of pixels within rectangle D using four array references 27
3.4 First and Second Features selected by ADABoost . . . . . . . . . . . . 30
3.5 Schematic Depiction of a Detection cascade . . . . . . . . . . . . . . . 30
3.6 ROC curves comparing a 200-feature classifier with a cascaded classifier
containing ten 20-feature classifiers . . . . . . . . . . . . . . . . . . . . 31
3.7 1st Result on Multiple Face Recognition . . . . . . . . . . . . . . . . . 35
3.8 2nd Result on Multiple Face Recognition . . . . . . . . . . . . . . . . . 35
4.1 Malicious object under test . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 HOG features of malicious object . . . . . . . . . . . . . . . . . . . . . 41
vi

4.3 Revolver recognition results . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Results for recognition of other malicious objects . . . . . . . . . . . . 43
5.1 An Artificial Neural Network consisting of an input layer, hidden layers
and an output layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 An ANN Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Two separate depictions of the recurrent ANN dependency graph . . . 48
5.4 Features obtained from the reduced STL-10 dataset by applying Convolu-
tion and Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.5 An Elman SRNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.6 Generating a free-form natural language descriptions of image regions . 60
5.7 An Overview of the approach . . . . . . . . . . . . . . . . . . . . . . . 61
5.8 Evaluating the Image-Sentence Score . . . . . . . . . . . . . . . . . . . 64
5.9 Diagram of the multi-modal Recurrent Neural Network generative model 66
6.1 A MongoDB Document . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 A MongoDB Collection of Documents . . . . . . . . . . . . . . . . . . . 70
6.3 Components of a MongoDB Find Operation . . . . . . . . . . . . . . . 71
6.4 Stages of a MongoDB query with a query criteria and a sort modifier . 72
6.5 Components of a MongoDB Insert Operation . . . . . . . . . . . . . . . 73
6.6 Components of a MongoDB Update Operation . . . . . . . . . . . . . . 73
6.7 Components of a MongoDB Remove Operation . . . . . . . . . . . . . 74
6.8 A query that uses an index to select and return sorted results . . . . . 75
6.9 A query that uses only the index to match the query criteria and return
the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.10 An index on the ”score” field (ascending) . . . . . . . . . . . . . . . . . 76
vii

6.11 A compound index on the ”userid” field (ascending) and the ”score” field
(descending) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.12 A Multikey index on the addr.zip field . . . . . . . . . . . . . . . . . . 77
6.13 Finding an individual’s image using the unique ID . . . . . . . . . . . . 78
6.14 Background Data corresponding to each individual . . . . . . . . . . . 79
7.1 Result 1 - Malicious Object Recognition using HOG Features . . . . . . 81
7.9 Result 1 - Semantic description of images using Artificial Neural Networks 84
7.19 Result 1 - Super Resolution - Estimate of SR image . . . . . . . . . . . 89
viii

7.20 Result 2 - Super Resolution - SR image . . . . . . . . . . . . . . . . . . 89
7.21 Result - Multi-Face and Malicious Object Recognition . . . . . . . . . . 90
7.22 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
ix

CHAPTER 1
Introduction
Criminal activity can easily go unnoticed; more so if the criminal is experienced. This
has led to multiple disasters in the past. With terrorist attacks shaking up the whole
country, it is the need of the hour to deploy technology to aid in prevention of further
tragedies like the Mumbai local train blasts and the 9/11 attack.
1.1 Problem definition
Terrorists usually aim at disrupting the economy of a nation as an effective assault since
they are the strength of any nation. Usually, they target high concentrations of people
that provide ample scope for large scale destruction. A large number of access points with
few or no inspection procedures compound security problems. The Mumbai suburban
railway alone has suffered 8 blasts and 368 people are believed to have died as a result so
far. The 9/11 attacks was one of the most terrifying incidents in world history, that most
researchers have dedicated their lives in this fight against terrorism through development
of such stochastic models to counteract this effect.
Besides facilitating travel and mobility for the people, a nation’s economy is hugely
dependent on the road and transit systems. Hence, apart from terrorizing people, sab-
otaging them has an ulterior motive of causing economic damage, and paralysing the
country.
This projects aims to accomplish the target of identifying and predicting suspicious
activity in public transport systems like trains and buses, by acquiring visual data and
applying machine learning to classify and identify criminal activity.
1.2 Previous work
Camera software, dubbed Cromatica, was being developed at Londons Kingston Univer-
sity to help improve security on public transport systems but it could be used on a wider
scale. It works by detecting differences in the images shown on the screen. For exam-
ple, background changes indicate a crowd of people and congestion. If there is a lot of

movement in the images, it could indicate a fight. The software could detect unattended
bags, people who are loitering and also detect if someone is going to commit suicide by
throwing themselves on the track. The biggest advantage of Cromatica is that is allows
the watchers to sift the evidence more efficiently. It alerts the supervisor about suspicious
activity and draws his attention to that detail. No successful indigenous attempts were
made to make similar systems in India.
A team led by the University of Alabama looked at computer models to forecast future
terrorist attacks. A four step process was used in this research. Researchers reviewed the
behaviour signatures of terrorists on 12000 attacks between 2003 and 2007 to calculate
the relativistic probabilities of future attacks on various target types. The four steps were:
Create a database of past attacks, identify trends in attacks, determine correlation in the
attacks and use analysis to calculate probabilities of future attacks. The purpose was to
provide officers with information which could be used in planning, but it did not give any
live alarm and was not based on real time monitoring which dampened the chances of
catching terrorists before the attack.
1.3 Motivation
The main idea for this project was inspired from the hit CBS Network TV series Person
of Interest, wherein a central machine receives live feeds from the NSA to sort out rele-
vant and irrelevant information with matters involving national security. After the 9/11
attacks, the United States Government gave itself the power to read every e-mail and
listen to every cell phone with numerous attempts to pinpoint terrorists from the general
population before they could act. Attempts like AbleDanger, SpinNaker and TIA have
been redacted, but have been assumed as failures. However, their biggest failure was bad
Public Resource - the public wanted to be protected, but they just didn’t want to know
how. Thus, we hope to build on that aspect of ensuring public safety through continuous
surveillance.
Furthermore, the 26/11 attacks in Mumbai really shook the country. It was sickening
to watch innocent civilians die for no logical reason. This project provides us with an
opportunity to begin to create something that has the potential to benefit not only the
country, but everyone around the world. And it could be one of the first moves in the
2

war against terrorism, which was one of the critical issues that had been addressed by
the Indian Prime Minister in his recent visit to the United States.
We are implementing a project similar to previous attempts to detect criminal activity,
but with more advanced prediction methods. Moreover, no successful attempt at the
hardware level has been made in India so far.
1.4 Overview
We have presented the material shown below based on the work we have completed in
each ﬁeld over the past three months. Chapter 2 details our work on super-resolution
using the simplest mathematical models - the forward and inverse models. Results have
been included in the chapter for each step and the entire model was written and tested
on software. Chapter 3 focusses on our work in face detection and recognition, using
the classical Viola-Jones algorithm and Principal Component Analysis using Eigen faces.
Multiple faces have been recognised in an image and the emphasis now is to shift towards
a larger set of images and video-based recognition. Chapter 4 briefs our work on the
detection and recognition of objects based on Histogram of Oriented Gradients. Chapter 5
talks about our current work on a deep visual-semantic alignment of images and sentences
to describe scenes in a video using the Multi-Modal Neural Network based approach.
Chapter 6 talks about the database management system that we have developed for this
project using MongoDB. Finally, Chapter 7 outlines our results at the end of our work
for the past 7 months, the problems we faced and how our prototype can be improved
upon.
3

CHAPTER 2
Super Resolution
2.1 Introduction
The goal of super-resolution, as the name suggests, is to increase the resolution of an
image. Resolution is a measure of frequency content in an image: high-resolution (HR)
images are band-limited to a larger frequency range than low-resolution (LR) images. In
the case of this project, we need to extract as much information as possible from the
image and as a result, we look at this technique. However, the hardware for HR images
is expensive and can be hard to obtain. The resolution of digital photographs is limited
by the optics of the imaging device. In conventional cameras, for example, the resolution
depends on CCD sensor density, which may not be sufficiently high. Infra-red (IR) and
X-ray devices have their own limitations.
Figure 2.1: Image Clarity Improvement using Super Resolution
Super-resolution is an approach that attempts to resolve this problem with software
rather than hardware. The concept behind this is time-frequency resolution. Wavelets,
filter banks, and the short-time Fourier transform (STFT) all rely on the relationship
between time (or space) and frequency and the fact that there is always a trade-off in
resolution between the two.
In the context of super-resolution for images, it is assumed that several LR images
(e.g. from a video sequence) can be combined into a single HR image: we are decreasing

the time resolution, and increasing the spatial frequency content. The LR images cannot
all be identical, of course. Rather, there must be some variation between them, such
as translational motion parallel to the image plane (most common), some other type of
motion (rotation, moving away or toward the camera), or different viewing angles. In
theory, the information contained about the object in multiple frames, and the knowledge
of transformations between the frames, can enable us to obtain a much better image of
the object. In practice, there are certain limitations: it might sometimes be difficult or
impossible to deduce the transformation. For example, the image of a cube viewed from a
different angle will appear distorted or deformed in shape from the original one, because
the camera is projecting a 3-D object onto a plane, and without a-priori knowledge of
the transformation, it is impossible to tell whether the object was actually deformed. In
general, however, super-resolution can be broken down into two broad parts: 1) registra-
tion of the changes between the LR images, and 2) restoration, or synthesis, of the LR
images into a HR image; this is a conceptual classification only, as sometimes the two
steps are performed simultaneously.
2.2 Certain Formulations
Tsai and Hunag were the first to consider the problem of obtaining a high-quality image
from several down-sampled and translationally displaced images in 1984. Their data
set consisted of terrestrial photographs taken by Land-Sat satellites. They modelled
the photographs as aliased, translationally displaced versions of a constant scene. Their
approach consisted in formulating a set of equations in the frequency domain, by using
the shift property of the Fourier transform. Optical blur or noise were not considered.
Tekalp, Ozkan and Sezan extended Tsai-Huang formulation by including the point spread
function of the imaging system and observation noise.
2.2.1 Recursive Least Squares
Kim, Bose, and Valenzuela use the same model as Huang and Tsai (frequency domain,
global translation), but incorporate noise and blur. Their work proposes a more com-
putationally efficient way to solve the system of equations in the frequency domain in
the presence of noise. A recursive least-squares technique is used. However, they do not
5

address motion estimation (the displacements are assumed to be known)due to the pres-
ence of zeroes in the Point Spread Function. The authors later extended their work to
make the model less sensitive to errors by the total least squares approach, which can be
formulated as a constrained minimization problem. This made the solution more robust
with respect to uncertainty of motion parameters.
2.2.2 Spatial Domain Methods
Most of the research done on super-resolution today is done on spatial domain methods.
Their advantages include a great flexibility in the choice of motion model, motion blur and
optical blur, and the sampling process. Another important factor is that the constraints
are much easier to formulate, for example, Markov random fields or projection onto
convex sets (POCS).
2.2.3 Projection and Interpolation
If we assume ideal sampling by the optical system, then the spatial domain formulation
reduces essentially to projection on a HR grid and interpolation of non-uniformly spaced
samples (provided motion estimation has already been done). A comparison of HR recon-
struction results with different interpolation techniques can be found. Several techniques
are given: nearest-neighbour, weighted average, least-squares plane fitting, normalized
convolution using a Gaussian kernel, Papoulis-Gerchberg algorithm, and iterative recon-
struction. It should be noted, however, that most optical systems cannot be modelled as
ideal impulse samplers.
2.2.4 Iterative Methods
Since super-resolution is a computationally intensive process, it makes sense to approach
it by starting with a ”rough guess” and obtaining successfully more refined estimates.
For example, Elad and Feuer use different approximations to the Kalman filter and anal-
yse their performance. In particular, recursive least squares (RLS), least mean squares
(LMS), and steepest descent (SD) are considered. Irani and Peleg describe a straight-
forward iterative scheme for both image registration and restoration, which uses a back-
6

projection kernel. In their later work, the authors modify their method to deal with
more complicated motion types, which can include local motion, partial occlusion, and
transparency. The basic back-projection approach remains the same, which is not very
ﬂexible in terms of incorporating a-priori constraints on the solution space. Shah and
Zakhor use a reconstruction method similar to that of Irani and Peleg. They also propose
a novel approach to motion estimation that considers a set of possible motion vectors for
each pixel and eliminate those that are inconsistent with the surrounding pixels.
2.3 Mathematical Model
We have created a uniﬁed framework from developed material which allowed us to for-
mulate HR image restoration as essentially a matrix inversion, regardless of how it is
implemented numerically. Super-resolution is treated as an inverse problem, where we
assumed that LR images are degraded versions of a HR image, even though it may not
exist as such. This allowed us to put together the building blocks for the degradation
model into a single matrix, and the available LR data into a single vector. The formation
of LR images becomes a simple matrix-vector multiplication, and the restoration of the
HR image a matrix inversion. Constraining of the solution space is accomplished with
Tikhonov regularization. The resulting model is intuitively simple (relying on linear al-
gebra concepts) and can be easily implemented in almost any programming environment.
In order to apply a super-resolution algorithm, a detailed understanding of how images
are captured and of the transformations they undergo is necessary. In this section, we
have developed a model that converts an image that could be obtained with a high-
resolution video camera to low-resolution images that are typically captured by a lesser-
quality camera. We then attempted to reverse the process to reconstruct the HR image.
Our approach is matrix-based. The forward model is viewed as essentially construction
of operators and matrix multiplication, and the inverse model as a pseudo-inverse of a
matrix.
7

2.3.1 Forward Model
Let X be a HR gray-scale image of size Nx×Ny. Suppose that this image is translationally
displaced, blurred, and down-sampled, in that order. This process is repeated N times.
The displacements may be different each time, but the down-sampling factors and the
blur remain the same, which is usually true for real-world image acquisition equipment.
Let d1, d2...dN denote the sequence of shifts and ’r’ the down-sampling factor, which may
be different in the vertical and horizontal directions, i.e. there are factors rx, ry. Thus,
we obtained N shifted, blurred, decimated versions (observed images) Y1, Y2...YN of the
original image.
The ”original” image, in the case of real data, may not exist, of course. In that case,
it can be thought of as an image that could be obtained with a very high-quality video
camera which has a (rx, ry) times better resolution and does not have blur, i.e. its Point
Spread Function is a delta function.
To be able to represent operations on the image as matrix multiplications, it is neces-
sary to convert the image matrix into a vector. Then we can form matrices which operate
on each pixel of the image separately. For this purpose, we introduce the operator vec,
which represents the lexicographic ordering of a matrix. Thus, a vector is formed from
vertical concatenation of matrix columns. Let us also define the inverse operator mat,
which converts a vector into a matrix. To simplify the notation, the dimensions of the
matrix are not explicitly specified, but are assumed to be known.
Let x = vec(X) and yi = vec(Yi), i = 1...N be the vectorized versions of the original
image and the observed images, respectively. We can represent the successive transfor-
mations of x - shifting, blurring, and down-sampling - separately from each other.
Shift
A shift operator moves all rows or all columns of a matrix up by one or down by one.
The row shift operator is denoted by Sx and the columns shift by Sy. Consider a sample
matrix
8

Mex =





1 4 7
2 5 8
3 6 9





After a row shift in the upward direction, this matrix becomes
mat(Sxvec(Mex)) =





2 5 8
3 6 9
0 0 0





Note that the last row of the matrix was replaced by zeros. Actually, this depends on
the boundary conditions. In this case, we assume that the matrix is zero-padded around
the boundaries, which corresponds to an image on a black background. Other boundary
conditions are possible, for example the Dirichlet boundary, when there is no change
along the boundaries, i.e. the image’s derivative on the boundary is zero. Another case
is the Neumann boundary condition, where the entries outside the boundary are replicas
of those inside. Column shift is deﬁned analogously to the row shift.
Most operators of interest in this thesis have block diagonal form: the only non-
zero elements are contained in sub-matrices along the main diagonal. To represent this,
let us use the notation diag(A, B, C, .....) to denote the block-diagonal concatenation of
matrices A,B,C... Furthermore, most operators are composed of the same block repeated
multiple times. Let diag(rep(B, n)) mean that the matrix B is diagonally concatenated
with itself n times. Then the row shift operator can be expressed as a matrix whose
diagonal blocks consist of the same sub-matrix B:
B =


0(nx−1)×1 Inx−1
01×1 01×(nx−1)


The shift operators have the form:
Sx(1) = diag(rep(B, ny))
Sy(1) =


0nx(ny−1)×nx Inx(ny−1)
0nx×nx 0nx×nx(ny−1)


9

Here and thereafter, In denotes an identity matrix of size n. 0nx×ny denotes a zero
matrix of size nx × ny. The total size of the shift operator is nxny × nxny. The notation
Sx(1), Sy(1) simply means that the shift is by one row or column, to diﬀerentiate it from
the multi-pixel shift to be described later.
As an example, consider a 3 × 2 matrix M. Its corresponding row shift operators is:
Sx(1) =














0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 0 0 0
0 0 0 0 1 0
0 0 0 0 0 1
0 0 0 0 0 0














It is apparent that this shift operator consists of diagonal concatenation of a block B
with itself, where
B =





0 1 0
0 0 1
0 0 0





For the column shift operator,
Sy(1) =














0 0 0 1 0 0
0 0 0 0 1 0
0 0 0 0 0 1
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0














For a shift in the opposite direction (the shifts above were assumed to be down
and to the right), the operators just have to be transposed. So, Sx(−1) = Sx(1) and
Sy(−1) = Sy(1).
Shift operators for multiple-pixel shifts can be obtained by raising the one-pixel shift
operator to the power equal to the size of the desired shift. Thus, the notation Sx(i),
Sy(i) denotes the shift operator corresponding to the displacement (dix, diy) between the
10

frames i and i-1, where Si = Sx(dix)Sy(diy). As an example, consider the shift operators
for the same matrix as before, but now for a 2-pixel shift:
Sx(2) = S2
x(1) =














0 0 1 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 1
0 0 0 0 0 0
0 0 0 0 0 0














The column shift operator in this case would be an all-zero matrix, since the matrix it
is applied to only has two elements itself. However, it is clear how multiple-shift operators
can be constructed from single-shift ones. It should be noted that simply raising a matrix
to a power may not work for some complicated boundary conditions, such as the reﬂexive
boundary condition. In such a case, the shift operators need to be modiﬁed for every
shift individually, depending on what the elements outside the boundary are assumed to
be.
Blur
Blur is a natural property of all image acquisition devices caused by the imperfections
of their optical systems. Blurring can also be caused by other factors, such as motion
(motion blur) or the presence of air (atmospheric blur), which we do not consider here.
Lens blur can be modelled by convolving the image with a mask (matrix) corresponding to
the optical system’s PSF. Many authors assume that blurring is a simple neighbourhood-
averaging operation, i.e. the mask consists of identical entries equal to one divided by
the size of the mask. Another common blur model is Gaussian. This corresponds to the
image being convolved with a two-dimensional Gaussian of size Gsize ×Gsize and standard
deviation σ2
. Since blurring takes place on the vectorized image, convolution is replaced
by matrix multiplication. In general, to represent convolution as multiplication, consider
a Toeplitz matrix of the form
11

T =











t0 t−1 .. t2−n t1−n
t1 t0 t−1 .. t2−n
: : : : :
tn−2 .. t1 t0 t−1
tn−1 tn−2 .. t1 t0











where negative indices were used for convenience of notation.
Now define the operation T = toeplitz(t) as converting a vector t = [t1−n, ...t−1, t0, t1, ..tn−1]
(of length 2∗n−1) to the form shown above, with the negative indices of t corresponding
to the first row of T and the positive indices to the first column, with t0 as the corner
element.
Consider a kxky × kxky matrix T of the form
T =








T0 T−1 .. T1−ky
T1 T0 T−1 :
: : : T−1
Tky−1 .. T1 T0








where each block Tj is a kx × kx Toeplitz matrix. This matrix is called block Toeplitz
with Toeplitz blocks (BTTB). Finally, two-dimensional convolution can be converted to
an equivalent matrix multiplication form:
t ∗ f = mat(Tvec(f))
where T is the kxky × kxky BTTB matrix of the form shown above with Tj =
toeplitz(t.,j). Here t.,j denotes the jth column of the (2kx − 1) × (2ky − 1) matrix
t.
The blur operator is denoted by H. Depending on the image source, the assumption
of blur can be omitted in certain cases. The results obtained for the blur model are as
shown below. Blur has been treated as a Gaussian signal in this case.
Downsampling
The two-dimensional down-sampling operator discards some elements of a matrix while
leaving others unchanged. In the case of downsampling-by-rows operator, Dx(rx), the
12

first row and all rows whose numbers are one plus a multiple of rx are preserved, while all
others are removed. Similarly, the downsampling-by-columns operator Dy(ry) preserves
the first column and columns whose numbers are one plus a multiple of ry, while removing
others. As an example, consider the matrix
Mex =








1 5 9 13 17 21 25
2 6 10 14 18 22 26
3 7 11 15 19 23 27
4 8 12 16 20 24 28








Suppose rx = 2. Then we have the downsampled-by-rows matrix
mat(Dxvec(Mex)) =


1 5 9 13 17 21 25
3 7 11 15 19 23 27


Suppose ry = 3. Then we have the downsampled-by-columns matrix
mat(Dyvec(Mex)) =








1 13 25
2 14 26
3 15 27
4 16 28








Matrices can be downsampled by both rows and columns. In the above example,
mat(DxDyvec(Mex)) =


1 13 25
3 15 27


It should be noted that the operations of downsampling by rows and columns com-
mute, however, the downsampling operators themselves do not. This is due to the re-
quirement that matrices must be compatible in size for multiplication. If the Dx operator
is applied first, its size must be SxSy
rx
× SxSy. The size of the Dy operator then must be
SxSy
rxry
× SxSy
rx
. The order of these operators, once constructed, cannot be reversed. Of
course, we could choose to construct any operator first.
We noticed that the downsampling-by-columns operator (Dy) is much smaller that
the downsampling-by-rows operator (Dx). This is because Dy will be multiplied not with
the original matrix M, but with the smaller matrix Dxvec(Mex), or Mex that has already
been downsampled by rows.
13

Data Model-Conclusions and Results
The observed images are given by:
yi = DHSix, i = 1, .....N
where D = DxDy and Si = Sx(dix)Sy(diy).
If we define a matrix Ai as the product of downsampling, blurring, and shift matrices,
Ai = DHSi
then the above equation can be written as yi = Aix, i = 1, ....., N
Furthermore, we can obtain all of the observed frames with a single matrix multi-
plication, rather than N multiplications as above. If all of the vectors yi are vertically
concatenated, the result is a vector y that represents all of the LR frames. Now, the
mapping from x to y is also given by the vertical concatenation of all matrices Ai. The
resulting matrix A consists of N block matrices, where each block matrix Ai operates
the same vector x. By property of block matrices, the product Ax is the same as if all
vectors yi were stacked into a single vector. Hence,
y = Ax
The above model assumes that there is a single image that is shifted by different
amounts. In practical applications, however, that is not the case. In the case of our
project, we are interested in some object that is within the field of view of the video
camera. This object is moving while the background remains fixed. If we consider only a
few frames (which can be recorded in a fraction of a second), we can define a ”bounding
box” within which the object will remain for the duration of observation. In this work,
this ”box” is referred to as the region of interest (ROI). All operations need to be done
only with the ROI, which is much more efficient computationally. It also poses the
additional problem of determining the object’s initial location and its movement within
the ROI. These issues will be described in the section dealing with motion estimation.
Results for the complete forward model is presented here. Shown below are 3 such
observations.
14

Figure 2.2: Forward Model results
Also, although noise is not explicitly included in the model, the inverse model formula-
tion (described next), assumes that additive white Gaussian (AWGN) noise, if present,
can be attenuated by a regularizer, and the degree of attenuation is controlled via the
regularization parameter.
2.3.2 Inverse Model
The goal of the inverse model is to reconstruct a single HR frame given several LR
frames. Since in the forward model the HR to LR tranformation is reduced to matrix
multiplication, it is logical to formulate the restoration problem as matrix inversion.
Indeed, the purpose of vectorizing the image and constructing matrix operators for image
transformations was to represent the HR-to-LR mapping as a system of linear equations.
First, it should be noted that this system may be under-determined. Typically, the
combination of all available LR frames contains only a part of the information in the
HR frame. Alternatively, some frames may contain redundant information (same set of
pixels). Hence, straightforward solution of the form ˆx = A−1
y is not feasible. Instead,
we could define the optimal solution as the one minimizing the discrepancy between the
observed and the reconstructed data in the least squares sense. For under-determined
systems, we could also define a solution with the minimum norm.
However, it is not practical to do so because it is known not known in advance whether
the system will be under-determined. The least-squares solution works in all cases. Let
us define a criterion function with respect to ˆx:
J(x) = λ||Qx||2
2 + ||y − Ax||2
2
where Q is the regularizing term and λ its parameter. The solution can then be defined
as
ˆx = arg(minx(J(x))
We can set the derivative of the function to optimize equal to the zero vector and solve
the resulting equation:
15

∂J(x)
∂x
= 0 = 2λQ Qˆx − 2A (y − Aˆx) = 0
ˆx = (A A + λQ Q)−1
A y
We can now see the role of the regularizing term. Without it, the solution would have
a term (A A)−1
. Multiplication by the downsampling matrix may cause A to have zero
rows or zero columns, making it singular. This is intuitively clear, since down-sampling
is an irreversible operation. The above expression would be non-invertible without the
regularizing term, which ”ﬁlls in” the missing values.
It is reasonable to choose Q to be a derivative-like term. This will ensure smooth tran-
sitions between the known points on the HR grid. If we let ∆x, ∆y to be the derivative
operators, we can write Q as
Q =


∆x
∆y


Then
Q Q =


∆x
∆y

 ∆x ∆y = ∆2
x + ∆2
y = L
where L is the discrete Laplacian operator. The Laplacian is a second-derivative term,
but for discrete data, it can be approximated by a single convolution with a mask of the
form





0 −1 0
−1 4 −1
0 −1 0





The operator L performs this convolution as matrix multiplication. It has the form
shown below (blanks represent zeroes). For simplicity, this does not take into account
the boundary conditions. This should only aﬀect pixels that are on the image’s edges,
and if they are relevant, the image can be extended by zero-padding.
L =




















4 −1 0 0 .. −1 0 0 ..
−1 4 −1 0 0 .. −1 0 0 ..
0 −1 4 −1 0 0 .. −1 0 0
: :
−1 −1 4 −1 −1
0 −1 −1 4 −1 :
0 0 −1 −1 4 −1 :
0 0 0 : : : : :




















16

Figure 2.3: Under-regularized, Optimally regularized and Over-regularized HR Image
The remaining question is how to choose the parameter λ. There exist formal methods for
choosing the parameter, such as generalized cross-validation (GCV) or the L-curve, but it
is not necessary to use them in all cases: the appropriate value may be selected by trial and
error and visual inspection, for example. A larger λ makes the system better conditioned,
but this new system is farther away from the original system (without regularization).
Under the no blur, no noise condition, any sufficiently small value of λ (that makes the
matrix numerically invertible) will produce almost the same result. In fact, the difference
will probably be lost during round-off, since most gray-scale image formats quantize
intensity levels to a maximum of 256. When blur is added to the model, however, λ
may need to be made much larger, in order to avoid high-frequency oscillations (ringing)
in the restored HR image. Since blurring is low-pass filtering, during HR restoration,
the inverse process, namely, high-pass filtering, occurs, which greatly amplifies noise. In
general, deblurring is an ill-posed problem. Meanwhile, without blurring, restoration is
in effect a simple interleaving and interpolation operation, which is not ill-conditioned.
Three HR restoration of the same LR sequences are shown above, with different values
of the parameter λ. The magnification is by a factor of 2 in both dimensions, and the
assumed blur kernel is 3 × 3 uniform. The image on the left was formed with, λ = 0.001,
and it is apparent that it is under-regularized: noise and motion artefacts have been
amplified as a result of de-blurring. For the image on the right, λ= 1 was used. This
17

Figure 2.4: Plot of GCV value as a function of λ
resulted in an overly smooth image, with few discernible details. The center image is
optimal, with λ = 0.11 as found by GCV. The GCV curve is shown in the next ﬁgure.
With de-blurring, there is an inevitable trade-oﬀ between image sharpness and the level
of noise.
Results of this mathematical approach on super-resolution has been presented here. An
estimate and the super-resolved image are obtained as shown.
Figure 2.5: Super resolved images using the forward-inverse model
2.4 Advantages of our solution
The expression for ˆx produces a vector, which after appropriate reshaping, becomes a
HR image. We are interested in how close that restored image resembles the ”original”.
As mentioned before, in realistic situations the ”original” does not exist. The properties
of the solution, however, can be investigated with existing HR images and simulated LR
18

images (formed by shifting, blurring, and down-sampling).
Let us define an error metric that formally measures how different the original and the
reconstructed HR images are:
= ||x−ˆx||2
||x||2
A smaller corresponds to a reconstruction that is closer to the original. Clearly, the
quality of reconstruction depends on the number of available LR frames and the relative
motion between these frames. Suppose, for example, that the down-sampling factor in
one direction is 4 and the object moves strictly in that direction at 4 HR pixels per frame.
Then, in the ideal noiseless case, all frames after the first one will contain the same set
of pixels. In fact, each subsequent frames will contain slightly less information, because
at each frame some pixels slide past the edge. Now supposed the object’s velocity is 2
HR pixels per frame. Than the first two frames will contain unique information, and the
rest will be duplicated. The reconstruction obtained with the only the first two frames
will be as good as that using many frames.
In the proposed solution, if redundant frames are added, the error as defined before will
stay approximately constant. In the case of real imagery, this has the effect of reducing
noise due to averaging. Generally speaking, the best results are obtained when there are
small random movements of the object in both directions (vertically and horizontally).
Even if the object remains in place, such movements can obtained by slightly moving the
camera.
Under the assumption of no blur and no noise, it can also be shown that there exists a
set of LR frames with which almost perfect reconstruction is possible. LR frames can
be thought of as being mapped onto the HR grid. If all points on the grid are filled,
the image is perfectly reconstructed. Suppose, for example, that the original HR image
is down-sampled by (2,3) (2 by rows and 3 by columns). Suppose the first LR frame is
generated by downsampling the HR image with no motion, i.e. its displacement is (0, 0).
Then the set of LR frames with the following displacements is sufficient for reconstruction:
(0, 0), (0, 1), (0, 2)
(1, 0), (1, 1), (1, 2)
In general, for downsampling by (rx, ry), all combinations of shifts from 0 to rx and 0 to
ry are necessary to fully reconstruct the image. If the estimate value is used, the error
19

defined by will be almost zero. The very small residual is due to the presence of the
regularization term and boundary effects.
2.5 Motion Estimation
Accurate image registration is essential in super-resolution. As seen previously, the matrix
A depends on the relative positions of the frames. It is well-known that motion estimation
is a very difficult problem due to its ill-posedness, the aperture problem, and the presence
of covered and uncovered regions. In fact, the accuracy of registration is in most cases
the limiting factor in HR reconstruction accuracy. The following are common problems
that arise in estimating inter-frame displacements:
• Local vs Global Motion (motion field rather than a single motion vector): If the
camera shifts and the scene is stationary, the relative displacement will be global
(the whole frame shifts). Typically, however, there are individual objects moving
within a frame, from leaves of a tree swaying in the wind to people walking or cars
moving.
• Non-linear motion: Most motion that can be observed under realistic conditions is
non-linear, but the problem is compounded by the fact observed 2-D image is only
a projection of the 3-D world. Depending on the relative position of the camera
and the object, the same object can appear drastically different. If simple affine
transformations, such as rotations on a plane, can theoretically be accounted for,
there is no way to deal with changes in the object’s shape itself, at least in non-
stereoscopic models.
• The ”correspondence problem” and the ”aperture problem”, described in image
processing literature: These arise when there are not features in an object being
observed to uniquely determine motion. The simplest example would be an object
of uniform color moving in front of the camera, so that its edges are not visible.
• The need to estimate motion with sub-pixel accuracy: It is the sub-pixel motion
that provides additional information in every frame, yet it has to be estimated from
LR data. The greater the desired magnification factor, the inner the displacements
that need to be differentiated.
• The presence of noise: Noise is a problem because it changes the gray level values
randomly. To a motion-estimation algorithm, it might appear as though each pixel
in a frame moves on its own, rather than uniformly as a part of a rigid object.
We do not want to delve into the mathematics of the gradient constraint equations (the
constraint here occurs as a result of continuity in optical flow), the Euler - Lagrange
equations, sum of squared differences, spatial cross-correlation and phase correlation, but
rather look at the approach we have taken to estimate motion between adjacent frames.
20

Figure 2.6: Image pair with a relative displacement of (8/3, 13/3) pixels
2.5.1 Approach used
The approach used in this project is to estimate the integer-pixel displacement using phase
correlation, then align the images with each other using this estimate, and finally compute
the subpixel shift by the gradient constraint equation. The figure below shows two aerial
photographs with a shift of (8, 13), down-sampled by 3 in both directions. The output of
the phase-correlation estimator was (3, 4), which is (8, 13)/3 rounded to whole numbers.
The second image was shifted back by this amount to roughly coincide with the first one.
Note that the images now appear to be aligned, but not identical, as can be seen from the
difference image. Now the relative displacement between them is less than one pixel, and
the gradient equation can be used. It yields (−0.2968, 0.2975). Now, adding the integer
and the fractional estimate, we obtain (3, 4) + (−0.2968, 0.2975) = (2.7032, 4.2975). If
this amount is multiplied by 3 and rounded, we obtain (8, 13). Thus we see that the
estimate is correct.
2.5.2 Combinatorial Motion Estimation
Registration of LR images is a difficult task, and its accuracy may be affected by many
factors, as stated before. Moreover, it is also known that all motion estimators have
inherent mathematical limitations, and in general, all of them are biased. The idea is to
consider different possibilities for the motion vectors, and pick the best one. Since for real
data, we do not know what a good HR image should like like, we define the best possibility
as the one that best fits the LR data in the mean-square sense. So, having computed an
21

Figure 2.7: Images aligned to the nearest pixel (top) and their diﬀerence image (bottom)
Figure 2.8: Block diagram of Combinatorial Motion Estimation for case k
22

HR image with a given set of motion vectors, we generate synthetic LR images from it
and calculate the discrepancy between them and the real LR images. The same procedure
is repeated, but with different motion vectors, and the motion estimate that yields the
minimum discrepancy is chosen. The schematic for this approach is presented above.
Suppose we have N LR frames and N − 1 corresponding motion vectors - one for each
pair of adjacent frames. The vector for the shift between the first and the second frame is
d1,k, between the second and the third d2,k, etc. The subscript k indicates that the motion
vectors are not unique and we are considering one of the possibilities. Based on these vec-
tors, we can generate both the HR image Xk and the LR images ˆY1,k, ˆY2,k........ˆYN,k, where
the circumflex is used to distinguish them from the real LR images Y1,k, Y2,k, ......YN,k (it
is assumed that the up-sampling/down-sampling factor is constant for all k). The LR
images can be converted into vector form, yl,k = vec(Yl,k) and ˆyl,k = vec(ˆYl,k). The error
(discrepancy) between the real and synthetic data is defined as
εk = Σl=1toN−1
||yl,k−ˆyl,k||2
||yl,k||2
Evaluating this equation for several motion estimates, we can choose the one that results
in the smallest ε.
2.5.3 Local Motion
Up until now, it has been assumed that the motion is global for the whole frame. Some-
times this is the case, for example when a camera is shaken randomly and the scene is
static. In most cases, however, we are interested in tracking a moving object or objects.
Even if there is a single object, it is usually moving against a relatively stationary back-
ground. One solution in this case is to extract the part of the frame that contains the
object, and work with that part only. One problem with that approach is the boundary
conditions. As described before, the model assumes that as the object shifts and part
of it goes out of view, the new pixels at the opposite end are filled according to some
predetermined pattern, e.g. all zeroes or the values of the previous pixels. In reality, of
course, the pixels on the object’s boundary do not change to zero when it shifts. This
discrepancy does not cause serious distortions as long as the shifts are small relative to
the object size. If all shifts are strictly subpixel, i.e. none exceeds one LR pixel from
the reference frame, at most the edge pixels will be affected. However, as the shifts get
larger, a progressively larger area around the edges of HR image is affected.
One solution is to create a ”buffer zone” around the object and process this whole area.
23

This is the region of interest (ROI). In this case, when the object’s movement is modelled
with shift operators, it is the surrounding area that gets replaced with zeroes, not the
object itself. Since only the object moves, and the area around it is stationary, and we are
treating all of ROI as moving globally, the result will be a distortion in the ”buffer zone”.
However, we can disregard this since we are only interested in the object. In effect, the
”buffer zone” serves as a placeholder for the object’s pixels. It needs to be large enough
to contain the object in all frames if the information about the object is to be preserved
in its entirety. The only problem may be distinguishing between the ”buffer zone” and
the object (i.e. the object’s boundaries) in the HR image, but this is usually apparent
visually.
24

CHAPTER 3
Face Detection and Recognition
3.1 Introduction
Face detection and recognition forms a very important part of detecting malicious activity
and preventing mishaps. If a registered offender enters the field of view of the camera, the
system should detect and recognise the person as a criminal, and alert the authorities.
This will enable identifying criminals in a public place, tracking convicted felons, and
also to catch wanted criminals. The camera detects faces, and checks the database of
criminal information available with the system to see whether any of the faces detected
belong to one of the criminals.
3.2 Face detection
Detecting faces in a picture may seem very natural to the human mind, but it is not so for
a computer. Face detection can be regarded as a specific case of object-class detection.
In object-class detection, the task is to find the locations and sizes of all objects in an
image that belong to a given class. Examples include upper torsos, pedestrians, and cars.
There are various algorithms and methodologies available to enable a computer to detect
faces in an image.
Face-detection algorithms focus on the detection of frontal human faces. It is analogous to
image detection in which the image of a person is matched bit by bit. Image matches with
the image stores in database. Any facial feature changes in the database will invalidate
the matching process. Face detection was performed in this project using the classical
Viola-Jones algorithm to detect people’s faces.
The Viola-Jones algorithm describes a face detection framework that is capable of pro-
cessing images extremely rapidly while achieving high detection rates. There are three
key contributions. The first is the introduction of a new image representation called
the ”Integral Image” which allows the features used by the detector to be computed very
quickly. The second is a simple and efficient classifier which is built using the ”AdaBoost”
learning algorithm to select a small number of critical visual features from a very large
set of potential features. The third contribution is a method for combining classifiers
in a ”cascade” which allows background regions of the image to be quickly discarded

while spending more computation on promising face-like regions. A set of experiments in
the domain of face detection is presented. Implemented on a conventional desktop, face
detection proceeds at 15 frames per second. It achieves high frame rates working only
with the information present in a single gray scale image.
3.2.1 Computation of features
The face detection procedure in the Viola-Jones algorithm classifies images based on the
value of simple features. Features can act to encode ad-hoc domain knowledge that is
difficult to learn using a finite quantity of training data. Thus, the feature-based system
operates much faster than a pixel-based system. Features used in this algorithm are
reminiscent of Haar Basis functions. More specifically, three kinds of features are used.
The value of a two-rectangle feature is the difference between the sums of the pixels within
two rectangular regions. The regions have the same size and shape and are horizontally
or vertically adjacent. A three-rectangle feature computes the sum within two outside
rectangles subtracted from the sum in a center rectangle. Finally a four-rectangle feature
computes the difference between diagonal pairs of rectangles.
Figure 3.1: Example rectangle features shown relative to the enclosing window
Rectangle features can be computed very rapidly using an intermediate representation
for the image which we call the integral image. The integral image can be computed from
an image using a few operations per pixel. The integral image at location x, y contains
the sum of the pixels above and to the left of x, y, inclusive:
inew(x, y) = Σx ≤x,y ≤yi(x, y)
26

Figure 3.2: Value of Integral Image at point (x,y)
where inew(x, y) is the integral image and i(x, y) is the original image.
Using the integral image any rectangular sum can be computed in four array references.
Figure 3.3: Calculation of sum of pixels within rectangle D using four array references
Clearly the difference between two rectangular sums can be computed in eight references.
Since the two-rectangle features defined above involve adjacent rectangular sums they
can be computed in six array references, eight in the case of the three-rectangle features,
and nine for four-rectangle features.
Rectangle features are somewhat primitive when compared with alternatives such as
steerable filters. Steerable filters, and their relatives, are excellent for the detailed analysis
of boundaries, image compression, and texture analysis. While rectangle features are also
sensitive to the presence of edges, bars, and other simple image structure, they are quite
coarse. Unlike steerable filters, the only orientations available are vertical, horizontal and
27

diagonal. Since orthogonality is not central to this feature set, we choose to generate a
very large and varied set of rectangle features. This over-complete set provides features
of arbitrary aspect ratio and of finely sampled location.
Empirically it appears as though the set of rectangle features provide a rich image rep-
resentation which supports effective learning. The extreme computational efficiency of
rectangle features provides ample compensation for their limitations.
3.2.2 Learning Functions
AdaBoost
Given a feature set and a training set of positive and negative images, any number of
machine learning approaches could be used to learn a classification function.
Boosting refers to a general and provably effective method of producing a very accurate
prediction rule by combining rough and moderately inaccurate rules of thumb. The
”AdaBoost” algorithm, introduced in 1995 by Freund and Schapire is one such boosting
algorithm. It does this by combining a collection of weak classification functions to form
a stronger classifier. The simple learning algorithm, with performance lower than what
is required, is called a weak learner. In order for the weak learner to be boosted, it is
called upon to solve a sequence of learning problems. After the first round of learning, the
examples are re-weighted in order to emphasize those which were incorrectly classified by
the previous weak classifier. The ”AdaBoost” procedure can be interpreted as a greedy
feature selection process.
In the general problem of boosting, in which a large set of classification functions are com-
bined using a weighted majority vote, the challenge is to associate a large weight with
each good classification function and a smaller weight with poor functions. ”AdaBoost”
is an aggressive mechanism for selecting a small set of good classification functions which
nevertheless have significant variety. Drawing an analogy between weak classifiers and
features, ”AdaBoost” is an re-weighted to increase the importance of misclassified sam-
ples. This process continues and at each step the weight of each week learner among
other learners is determined.
We assume that our weak learning algorithm (weak learner) can consistently find weak
classifiers (rules of thumb which classify the data correctly at better than 50%). Given
this assumption ,we can use adaboost to generate a single weighted classifier which cor-
rectly classifies our data at 99%-100%. The adaboost procedure focuses on difficult data
28

points which have been misclassified by the previous weak classifier. It uses an optimally
weighted majority vote of weak classifier. The data is re-weighted to increase the impor-
tance of misclassified samples. This process continues and at each step the weight of each
week learner among other learners is determined.
The algorithm is given below with an example. Let H1 and H2 be 2 weak learners in a
process where neither H1 nor H2 is a perfect learner, but Adaboost combines them to
make a good learner. The algorithm steps are given below -
1. Set all sample weights equal, to find H1 to maximize the sum of yih(xi).
2. Perform reweighing to increase the weight of the misclassified samples.
3. Find the next H to maximize the sum of yih(xi).Find the weight of this classifier,
let it br α.
4. Go to step 2.
The final classifier will be sgn(X(i = 1 : t)αθ(X)).
Cascading
A cascade of classifiers is constructed which achieves increased detection performance
while radically reducing computation time. Smaller, and therefore more efficient, boosted
classifiers can be constructed which reject many of the negative sub-windows while de-
tecting almost all positive instances. Simpler classifiers are used to reject the majority of
sub-windows before more complex classifiers are called upon to achieve low false positive
rates.
Stages in the cascade are constructed by training classifiers using ”AdaBoost”. Starting
with a two-feature strong classifier, an effective face filter can be obtained by adjusting the
strong classifier threshold to minimize false negatives. Based on performance measured
using a validation training set, the two-feature classifier can be adjusted to detect 100% of
the faces with a false positive rate of 50%. The performance can be increased significantly,
by adding more layers to the cascade structure. The classifier can significantly reduce
the number of sub-windows that need further processing with very few operations.
The overall form of the detection process is that of a degenerate decision tree, what we
call a ”cascade”. A positive result from the first classifier triggers the evaluation of a
second classifier which has also been adjusted to achieve very high detection rates. A
positive result from the second classifier triggers a third classifier, and so on. A negative
29

Figure 3.4: First and Second Features selected by ADABoost
outcome at any point leads to the immediate rejection of the sub-window. The structure
of the cascade reﬂects the fact that within any single image an overwhelming majority of
sub-windows are negative. As such, the cascade attempts to reject as many negatives as
possible at the earliest stage possible.
Figure 3.5: Schematic Depiction of a Detection cascade
The user selects the maximum acceptable rate for ’false positives’ and the minimum
acceptable rate for ’detections’. Each layer of the cascade is trained by ’AdaBoost’ with
the number of features used being increased until the target detection and false positive
rates are met for this level. If the overall target false positive rate is not yet met then
another layer is added to the cascade.
3.3 Recognition using PCA on Eigen faces
A facial recognition system is a computer application for automatically identifying or
verifying a person from a digital image or a video frame from a video source. One of
30

Figure 3.6: ROC curves comparing a 200-feature classifier with a cascaded classifier con-
taining ten 20-feature classifiers
the ways to do this is by comparing selected facial features from the image and a facial
database. Traditionally, some facial recognition algorithms identify facial features by
extracting landmarks, or features, from an image of the subject’s face. For example,
an algorithm may analyse the relative position, size, and/or shape of the eyes, nose,
cheekbones, and jaw. These features are then used to search for other images with
matching features. Other algorithms normalize a gallery of face images and then compress
the face data, only saving the data in the image that is useful for face recognition. A probe
image is then compared with the face data. One of the earliest successful systems is based
on template matching techniques applied to a set of salient facial features, providing a
sort of compressed face representation.
Popular recognition algorithms include Principal Component Analysis using Eigen-Faces,
Linear Discriminate Analysis, Elastic Bunch Graph Matching using the Fisher-face al-
gorithm, the Hidden Markov model, the Multi-linear Subspace Learning using tensor
representation, and the neuronal motivated dynamic link matching. We have chosen the
most basic algorithm, PCA using Eigen-Faces.
31

3.3.1 Introduction to Principle Component Analysis
Principal Component Analysis is a widely used technique in the fields of signal process-
ing, communications, control theory and image processing. In the PCA approach, the
component matching relies on original data to build Eigen Faces. In other words, it builds
M eigenvectors for an N × M matrix. They are ordered from the largest to the lowest,
where the largest eigenvalue is associated with the vector that finds the most variance in
the image. An advantage of PCA to other methods is that 90% of the total variance is
contained in 5-10% of the dimensions. To classify an image we find the eigenface with
smallest Euclidean distance from the input face.
Principle component analysis aims to catch the total variation in the set of training faces,
and to explain the variation by a few variables. In fact, observation described by a few
variables is easier to understand than one defined by a huge number of variables and
when many faces have to be recognized the dimensionality reduction is important. The
other main advantage of PCA is that, once you have found these patterns in the data, you
compress the data reducing the number of dimensions without much loss of information.
3.3.2 Eigen Face Approach
Calculation of Eigen Values and Eigen Vectors
The eigen vectors of a linear operator are non-zero vectors which, when operated on by
the operator, result in a scalar multiple of them. The scalar is then called the eigenvalue
(λ) which is associated with the eigenvector(X). Eigen vector is a vector that is scaled by
a linear transformation. It is a property of a matrix. When a matrix acts on it, only the
vector magnitude is changed not the direction.
AX = λX
where A is a vector function.
From above equation we arrive at the following equation
(A − λI)X = 0
where I is an N × N identity matrix. This is a homogeneous system of equations, and
from fundamental linear algebra, we know that a non-trivial solution exists if and only if
|(A − λI)| = 0
32

When evaluated, the determinant becomes a polynomial of degree n. This is known as
the characteristic equation of A, and the corresponding polynomial is the characteristic
polynomial. The characteristic polynomial is of degreen. If A is an n × n matrix, then
there are n solutions or n roots of the characteristic polynomial. Thus there are n
eigenvalues of A satisfying the the following equation.
AXi = λXi
where i=1,2,3...
If the eigenvalues are all distinct, there are n associated linearly independent eigenvectors,
whose directions are unique, which span an n dimensional Euclidean space. In the case
where there are r repeated eigenvalues, then a linearly independent set of n eigenvectors
exist, provided the rank of the matrix (A − λI) is rank n − r. Then, the directions of the
r eigenvectors associated with the repeated eigenvalues are not unique.
3.3.3 Procedure incorporated for Face Recognition
Creation Of Face Space
From the given set of M images we reduce the dimensionality to M’. This is done by
selecting the M eigen faces which have the largest associated eigen values. These eigen
faces now span an M-dimensional which reduces computational time. To reconstruct the
original image from the eigen faces, we would have to build a kind of weighted sum of
all eigen faces (Face Space) with each eigen face having a certain weight. This weight
specifies, to what degree the specific feature (eigen face) is present in the original image.
If we use all the eigen faces extracted from original images, we can reconstruct the original
images from the eigen faces exactly. But we can also use only a part of the eigenfaces.
Then the reconstructed image is an approximation of the original image. By considering
the important or more prominent eigen faces, we can be assured that there is not much
loss of information in the rebuilt image.
Calculation of Eigen Values
The training set of images are given as input to find eigenspace. The difference of these
images is represented by covariance matrix. The eigen values of all the vectors are found
out using the co-variance matrix which is centred around the mean. The eigen vectors of
the co-variance matrix is calculated using an in built matlab function. The eigen values
33

are then sorted and stored and the most dominant eigen vectors are extracted. Based on
the dimensionality we give,the number of eigen faces is decided.
Training of Eigen Faces
A database of all training and testing images is created. We give the number of training
samples and all those images are then projected over our eigen faces, where the difference
between the image and the centred image is calculated. The new image T is transformed
into its eigenface components (projected into ’face space’) by a simple operation,
wk = µkT
(T − ψ) k = 1, 2....M
The weights obtained above form a vector ΩT
= [w1, w2, w3, ...wM ] that describes the
contribution of each eigen face in representing the input face image. The vector may
then be used in a standard pattern recognition algorithm to find out which of a number
of predefined face class, if any, best describes the face.
Face Recognition Process
The above process is applied to the test image and all the images in the training set. The
test image and the training images are projected over the eigen faces. The differentials
on the various axes for the projected test images over the projected training images is
found out, and based on these results, the Euclidean distance is calculated. Based on the
various Euclidean results calculated, we find the least of them all and the corresponding
class of the training images is given. The recognition index is then divided by total
number of trained images to give the recognized ”class” of the image.
3.3.4 Significance of PCA approach
In PCA approach, we are reducing the dimensionality of face images and are enhanc-
ing the speed for face recognition. We can choose only M’ Eigenvectors with highest
Eigenvalues. Since, the lower Eigenvalues does not provide much information about face
variations in corresponding Eigenvector direction, such small Eigenvalues can be neglected
to further reduce the dimension of face space. This does not affect the success rate much
and is acceptable depending on the application of face recognition. The approach using
Eigen-faces and PCA is quite robust in the treatment of face images with varied facial
expressions as well as directions. It is also quite efficient and simple in the training and
34

recognition stages, dispensing low level processing to verify the facial geometry or the
distances between the facial organs and their dimensions. However, this approach is sen-
sitive to images with uncontrolled illumination conditions. One of the limitations of the
eigen-face approach is in the treatment of face images with varied facial expressions and
with glasses.
3.4 Results
Figure 3.7: 1st Result on Multiple Face Recognition
Figure 3.8: 2nd Result on Multiple Face Recognition
We were able to recognise multiple faces correctly using this algorithm based on Viola-
Jones detection and Eigen Face based PCA. We restricted ourselves to only a small group
of people, and we will need to have a large amount of faces for eﬀective judgement of the
35

results we obtained. We have trained 14 faces (samples) till now and the corresponding
results are shown in Chapter 7. After facial recognition, we assign a UID tag to each
person (Similar to SSN or Aadhar Number) and then viewing his entire history by porting
data through queries on a database established using MongoDB.
36

CHAPTER 4
Object Recognition using Histogram of Oriented
Gradients
4.1 Introduction
Object recognition, in computer vision, is the task of finding and identifying objects in an
image or video sequence. Humans recognize a multitude of objects in images with little
effort, despite the fact that the image of the objects may vary somewhat in different view
points, in many different sizes and scales or even when they are translated or rotated.
Objects can even be recognized when they are partially obstructed from view. This task
is still a challenge for computer vision systems. Many approaches to the task have been
implemented over multiple decades. In the case of our project, we need to recognise
malicious objects, such as guns, bombs, knives, etc. and this clearly showcases the fact
that we need a comprehensive approach in this area of study. We accomplish this task of
object recognition using the Histogram of Oriented Gradients.
Histogram of Oriented Gradients (HOG) are feature descriptors used in computer vision
and image processing for the purpose of object detection. The technique counts occur-
rences of gradient orientation in localized portions of an image. This method is similar
to that of edge orientation histograms, scale-invariant feature transform descriptors, and
shape contexts, but differs in that it is computed on a dense grid of uniformly spaced
cells and uses overlapping local contrast normalization for improved accuracy.
4.2 Theory and its inception
Navneet Dalal and Bill Triggs, researchers for the French National Institute for Re-
search in Computer Science and Control (INRIA), first described Histogram of Oriented
Gradient descriptors in their June 2005 CVPR paper. In this work they focused their
algorithm on the problem of pedestrian detection in static images, although since then
they expanded their tests to include human detection in film and video, as well as to a
variety of common animals and vehicles in static imagery.
The essential thought behind the Histogram of Oriented Gradient descriptors is that
local object appearance and shape within an image can be described by the distribution
of intensity gradients or edge directions. The implementation of these descriptors can be

achieved by dividing the image into small connected regions, called cells, and for each cell
compiling a histogram of gradient directions or edge orientations for the pixels within the
cell. The combination of these histograms then represents the descriptor. For improved
accuracy, the local histograms can be contrast-normalized by calculating a measure of
the intensity across a larger region of the image, called a block, and then using this value
to normalize all cells within the block. This normalization results in better invariance to
changes in illumination or shadowing.
The HOG descriptor maintains a few key advantages over other descriptor methods.
Since the HOG descriptor operates on localized cells, the method upholds invariance to
geometric and photometric transformations, except for object orientation. Such changes
would only appear in larger spatial regions. Moreover, as Dalal and Triggs discovered,
coarse spatial sampling, fine orientation sampling, and strong local photometric normal-
ization permits the individual body movement of pedestrians to be ignored so long as
they maintain a roughly upright position. The HOG descriptor is thus particularly suited
for human detection in images.
4.3 Algorithmic Implementation
4.3.1 Gradient Computation
The first step of calculation in many feature detectors in image pre-processing is to ensure
normalized color and gamma values. As Dalal and Triggs point out, however, this step
can be omitted in HOG descriptor computation, as the ensuing descriptor normalization
essentially achieves the same result. Image pre-processing thus provides little impact on
performance. Instead, the first step of calculation is the computation of the gradient
values. The most common method is to simply apply the 1-D centered, point discrete
derivative mask in one or both of the horizontal and vertical directions. Specifically, this
method requires filtering the color or intensity data of the image with the following filter
kernels:
[−1, 0, 1] and [−1, 0, 1]T
Dalal and Triggs tested other, more complex masks, such as 3 × 3 Sobel masks (Sobel
operator) or diagonal masks, but these masks generally exhibited poorer performance in
human image detection experiments. They also experimented with Gaussian smoothing
before applying the derivative mask, but similarly found that omission of any smoothing
38

performed better in practice.
4.3.2 Orientation Binning
The second step of calculation involves creating the cell histograms. Each pixel within
the cell casts a weighted vote for an orientation-based histogram channel based on the
values found in the gradient computation. The cells themselves can either be rectangular
or radial in shape, and the histogram channels are evenly spread over 0 to 180 degrees or 0
to 360 degrees, depending on whether the gradient is unsigned or signed. Dalal and Triggs
found that unsigned gradients used in conjunction with 9 histogram channels performed
best in their human detection experiments. As for the vote weight, pixel contribution
can either be the gradient magnitude itself, or some function of the magnitude; in actual
tests the gradient magnitude itself generally produces the best results. Other options for
the vote weight could include the square root or square of the gradient magnitude, or
some clipped version of the magnitude.
4.3.3 Descriptor Blocks
In order to account for changes in illumination and contrast, the gradient strengths must
be locally normalized, which requires grouping the cells together into larger, spatially
connected blocks. The HOG descriptor is then the vector of the components of the
normalized cell histograms from all of the block regions. These blocks typically overlap,
meaning that each cell contributes more than once to the ﬁnal descriptor. Two main
block geometries exist: rectangular R-HOG blocks and circular C-HOG blocks. R-HOG
blocks are generally square grids, represented by three parameters: the number of cells
per block, the number of pixels per cell, and the number of channels per cell histogram.
In the Dalal and Triggs human detection experiment, the optimal parameters were found
to be 3×3 cell blocks of 6×6 pixel cells with 9 histogram channels. Moreover, they found
that some minor improvement in performance could be gained by applying a Gaussian
spatial window within each block before tabulating histogram votes in order to weight
pixels around the edge of the blocks less. The R-HOG blocks appear quite similar to the
scale-invariant feature transform descriptors; however, despite their similar formation,
R-HOG blocks are computed in dense grids at some single scale without orientation
alignment, whereas SIFT descriptors are computed at sparse, scale-invariant key image
points and are rotated to align orientation. In addition, the R-HOG blocks are used in
39

conjunction to encode spatial form information, while SIFT descriptors are used singly.
C-HOG blocks can be found in two variants: those with a single, central cell and those
with an angularly divided central cell. In addition, these C-HOG blocks can be described
with four parameters: the number of angular and radial bins, the radius of the center
bin, and the expansion factor for the radius of additional radial bins. Dalal and Triggs
found that the two main variants provided equal performance, and that two radial bins
with four angular bins, a center radius of 4 pixels, and an expansion factor of 2 provided
the best performance in their experimentation. Also, Gaussian weighting provided no
benefit when used in conjunction with the C-HOG blocks. C-HOG blocks appear similar
to Shape Contexts, but differ strongly in that C-HOG blocks contain cells with several
orientation channels, while Shape Contexts only make use of a single edge presence count
in their formulation.
4.3.4 Block Normalization
Dalal and Triggs explore four different methods for block normalization. Let v be the
non-normalized vector containing all histograms in a given block, ||v||k be its k-norm for
k = 1, 2 and e be some small constant (the exact value, hopefully, is unimportant). Then
the normalization factor can be one of the following:
L2-norm: f = v√
||v||2
2+e2
L2-hys: L2-norm followed by clipping (limiting the maximum
values of v to 0.2) and renormalizing L1-norm: f = v
||v||1+e
L1-sqrt: f = v
||v||1+e
In addition, the scheme L2-Hys can be computed by first taking the L2-norm, clipping
the result, and then renormalizing. In their experiments, Dalal and Triggs found the
L2-Hys, L2-norm, and L1-sqrt schemes provide similar performance, while the L1-norm
provides slightly less reliable performance; however, all four methods showed very signif-
icant improvement over the non-normalized data.
4.3.5 SVM classifier
The final step in object recognition using Histogram of Oriented Gradient descriptors
is to feed the descriptors into some recognition system based on supervised learning.
The Support Vector Machine classifier is a binary classifier which looks for an optimal
hyperplane as a decision function. Once trained on images containing some particular
object, the SVM classifier can make decisions regarding the presence of an object, such
as a human being, in additional test images. In the Dalal and Triggs human recognition
40

tests, they used the freely available SVMLight software package in conjunction with their
HOG descriptors to find human figures in test images.
4.4 Implementation in MATLAB
Figure 4.1: Malicious object under
test
Figure 4.2: HOG features of malicious
object
VL Feat Toolbox is used to compute the HOG features of an image. In our project, we
wish to recognize malicious weapons such as guns, revolvers, knives, etc. Fortunately,
we were able to train the object detector using a set of 82 images of revolvers with their
HOG features. The images were obtained from the Caltech-101 dataset. We then took
a few iamges from the WikiPedia page for different kinds of guns and rifles, and then
trained them as well. Also, we took weapon samples from quite a few popular TV series
and movies like Person of Interest, The Wire, The A-Team and Pulp Fiction to name a
few.
To train the revolver model, the MATLAB function ’trainCascadeObjectDetector’ was
used with the set of images as the dataset. The model was trained using HOG features
with a 10 stage cascade classifier. Also a set of 50 negative images were also provided to
train the model for revolver detection.
To test this trained model, we used the ’CascadeObjectDetector’ with the trained model
as an input to the function. This method is available in the Computer Vision Toolbox in
MATLAB.
41

4.4.1 Cascade Classifiers
Cascading is a particular case of ensemble learning based on the concatenation of sev-
eral classifiers, using all information collected from the output from a given classifier as
additional information for the next classifier in the cascade. Unlike voting or stacking
ensembles, which are multi-expert systems, cascading is a multi-stage one. The first
cascading classifier is the face detector of Viola and Jones (2001).
Cascade classifiers are susceptible to scaling and rotation. Separate cascade classifiers
have to be trained for every rotation that is not in the image plane and will have to
be retrained or run on rotated features for every rotation that is in the image plane.
Cascades are usually done through cost-aware ADAboost. The sensitivity threshold can
be adjusted so that there is close to 100% true positives and some false positives. The
procedure can then be started again, until the desired accuracy/computation time is
desired.
4.5 Results
We were able to obtain the following results for various revolvers after training them over
the CalTech101 data set. Since HOG is an in-plane based feature extractor, a test image
with a gun that is out of plane cannot be recognised. So, for each of these out of plane
angles, the revolver must be trained.
Figure 4.3: Revolver recognition results
The above results only show the case of revolvers as malicious objects. There are other
42

negatives too in the directory based calls, to provide for negative samples. We have
also recognised many other malicious objects over the past December as indicated in the
timeline - knives, riﬂes, shotguns, pistols, etc. The results for the same are presented
below.
Figure 4.4: Results for recognition of other malicious objects
Chapter 7 once again details the prediction of a malicious activity based on the presence
or absence of a malicious object and we shall revisit the results obtained using HOG
features in that chapter.
43

CHAPTER 5
Neural Network based Semantic Description of
Image Sequences using the Multi-Modal Approach
This chapter describes our attempt at the prediction of a malicious activity by using
Multi-modal Recurrent Neural Networks that describe images with sentences. The idea
is to generate a linguistic model to an image and then compare the sentence thus obtained
with a set of pre-defined words that describe malicious/ criminal activity to detect an
illegal activity. If an activity of such malicious intent is detected, we proceed with the
techniques described before to check if the person who is engaged in the physical activity
has a registered weapon under his name and we then check his past criminal records by
checking the appropriate fields in the database.
This line of work was recently featured in a New York Times article and has been the sub-
ject of multiple academic papers from the research community over the last few months.
We are currently implementing the models proposed by Vinyals et al. from Google (CNN
+ LSTM) and by Karpathy and Fei-Fei from Stanford (CNN + RNN). Both models take
an image and predict its sentence description with a Recurrent Neural Network (either
an LSTM or an RNN). To understand what each of these high ”funda” technical terms
even mean, a background knowledge is needed on what is an artificial neural network and
how it has been incorporated into our project.
To understand the gravity of the work we have done here, there was a necessity for
developing a strong foundation in artificial neural networks, a subject that we had to go
through thoroughly from scratch as part of this major project. After going through a few
basic points in the flow of a neural network, we introduce what a Convolutional Neural net
is, how it is used in image classification and then look at the Recurrent Neural Network
for semantic description. We then clubbed the two using the multi-modal approach.
5.1 Artificial Neural Networks
In machine learning, artificial neural networks (ANNs) are a family of statistical learning
algorithms inspired by biological neural networks (the central nervous systems of animals,
in particular the brain) to estimate functions that can depend on a large number of
inputs and are generally unknown. Artificial neural networks are generally presented
as systems of interconnected ”neurons” which can compute values from inputs, and are

capable of machine learning as well as pattern recognition thanks to their adaptive nature.
For example, a neural network for handwriting recognition is defined by a set of input
neurons which may be activated by the pixels of an input image. After being weighted
and transformed by a function (determined by the network’s designer), the activations of
these neurons are then passed on to other neurons. This process is repeated until finally,
an output neuron is activated. This determines which character was read.
Like other machine learning methods - systems that learn from data - neural networks
have been used to solve a wide variety of tasks that are hard to solve using ordinary
rule-based programming, including computer vision, one of which is activity recognition.
5.1.1 Introduction
In an Artificial Neural Network, simple artificial nodes, known as ”neurons”, ”neurodes”,
”processing elements” or ”units”, are connected together to form a network which mim-
ics a biological neural network. A class of statistical models may commonly be called
”Neural” if they possess the following characteristics:
• consist of sets of adaptive weights, i.e. numerical parameters that are tuned by a
learning algorithm, and
• are capable of approximating non-linear functions of their inputs
The adaptive weights are conceptually connection strengths between neurons, which are
activated during training and prediction.
Neural networks are similar to biological neural networks in performing functions collec-
tively and in parallel by the units, rather than there being a clear delineation of subtasks
to which various units are assigned. The term ”neural network” usually refers to models
employed in statistics, cognitive psychology and artificial intelligence. Neural network
models which emulate the central nervous system are part of theoretical and computa-
tional neuroscience.
In modern software implementations of artificial neural networks, the approach inspired
by biology has been largely abandoned for a more practical approach based on statis-
tics and signal processing. In some of these systems, neural networks or parts of neural
networks (like artificial neurons) form components in larger systems that combine both
adaptive and non-adaptive elements. While the more general approach of such systems
is more suitable for real-world problem solving, it has little to do with the traditional
45

Figure 5.1: An Artificial Neural Network consisting of an input layer, hidden layers and
an output layer
artificial intelligence connectionist models. What they do have in common, however, is
the principle of non-linear, distributed, parallel and local processing and adaptation. His-
torically, the use of neural networks models marked a paradigm shift in the late eighties
from high-level (symbolic) artificial intelligence, characterized by expert systems with
knowledge embodied in if-then rules, to low-level (sub-symbolic) machine learning, char-
acterized by knowledge embodied in the parameters of a dynamical system.
5.1.2 Modelling an Artificial Neuron
Neural network models in AI are usually referred to as artificial neural networks (ANNs);
these are simple mathematical models defining a function f : X → Y or a distribution
over X or both X and Y , but sometimes models are also intimately associated with a
particular learning algorithm or learning rule. A common use of the phrase ANN model
really means the definition of a class of such functions (where members of the class are
obtained by varying parameters, connection weights, or specifics of the architecture such
as the number of neurons or their connectivity).
46

Figure 5.2: An ANN Dependency Graph
Network Function
The word network in the term ’artificial neural network’ refers to the interconnections
between neurons in different layers of each system. An example system has three layers -
the input neurons which send data via synapses to the second layer of neurons, and then
via more synapses to the third layer of output neurons. More complex systems will have
more layers of neurons with some having increased layers of input neurons and output
neurons. The synapses store parameters called ”weights” that manipulate the data in
the calculations. An ANN is typically defined by three types of parameters -
• The interconnection pattern between the different layers of neurons
• The learning process for updating the weights of the interconnections
• The activation function that converts a neuron’s weighted input to its output acti-
vation
Mathematically, a neuron’s network function f(x) is defined as a composition of other
functions gi(x), which can further be defined as a composition of other functions. This
can be conveniently represented as a network structure, with arrows depicting the depen-
dencies between variables. A widely used type of composition is the non-linear weighted
sum, where f(x) = K ( i wigi(x)) , where K (commonly referred to as the activation
function) is some predefined function, such as the hyperbolic tangent or the sigmoid
function. It will be convenient for the following to refer to a collection of functions gi as
simply a vector g = (g1, g2, . . . , gn).
This figure depicts such a decomposition of f, with dependencies between variables indi-
cated by arrows. These can be interpreted in two ways.
The first view is the functional view: the input x is transformed into a 3-dimensional
vector h, which is then transformed into a 2-dimensional vector g, which is finally trans-
formed into f. This view is most commonly encountered in the context of optimization.
The second view is the probabilistic view: the random variable F = f(G) depends upon
the random variable G = g(H), which depends upon H = h(X), which depends upon the
47

Figure 5.3: Two separate depictions of the recurrent ANN dependency graph
random variable X. This view is most commonly encountered in the context of graphical
models.
The two views are largely equivalent. In either case, for this particular network archi-
tecture, the components of individual layers are independent of each other (e.g. the
components of g are independent of each other given their input h). This naturally
enables a degree of parallelism in the implementation.
Networks such as the previous one are commonly called feedforward, because their graph
is a directed acyclic graph. Networks with cycles are commonly called recurrent. Such
networks are commonly depicted in the manner shown at the top of the figure, where f
is shown as being dependent upon itself. However, an implied temporal dependence is
not shown.
Learning
What has attracted the most interest in neural networks is the possibility of learning.
Given a specific task to solve, and a class of functions F, learning means using a set of
observations to find f∗
in F which solves the task in some optimal sense. This entails defining a cost function
C : F → R such that, for the optimal solution f∗
, C(f∗
) ≤ C(f) ∀ f ∈ F i.e., no solution
has a cost less than the cost of the optimal solution.
The cost function C is an important concept in learning as it is a measure of how far
away a particular solution is from an optimal solution to the problem to be solved.
Learning algorithms search through the solution space to find a function that has the
smallest possible cost. For applications where the solution is dependent on some data,
the cost must necessarily be a function of the observations, otherwise we would not be
modelling anything related to the data. It is frequently defined as a statistic to which
only approximations can be made. As a simple example, consider the problem of finding
the model f, which minimizes C = E [(f(x) − y)2
], for data pairs (x, y) drawn from some
distribution D. In practical situations we would only have N samples from D and thus,
48

Final Report - Major Project - MAP

Final Report - Major Project - MAP

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Final Report - Major Project - MAP

Similar to Final Report - Major Project - MAP (20)

Final Report - Major Project - MAP