thesis-2

Autonomous Intelligent System
Prof. Dr. Peter Nauth
Autonomous Behaviour of Drone using
Machine Learning
Master’s Thesis in Machine Learning
written by
Danish Shrestha
1029721
supervised by
Prof. Dr. Peter Nauth (ﬁrst supervisor)
Prof. Dr. Andreas Pech (second supervisor)
April 2015

Declaration
To the best of my knowledge and belief I hereby declare that this thesis, entitled
Autonomous Behaviour of Drone using Machine Learning
was prepared without aid from any other sources except where indicated. Reference to
material previously published by any other person has been duly acknowledged. This
work contains no material which has been submitted or accepted for the award of any
other degree in any institution. I am aware that my thesis, developed under guidance,
is part of the examination and may not be commercially used or transferred to a third
party without written permission from my supervisory professor.
Frankfurt, 23.04.2015
Danish Shrestha

Abstract
In this thesis, we developed a system that enables a quadcopter to navigate to all the
target locations predefined avoiding the obstacles that are detected. For detection of
the obstacle, localization of the quadcopter and navigation we utilize the data available
from the monocular camera and other onboard sensors available in quadcopter, and
does not require any external sensors.
Our method consist of three main components. First, we achieve the estimate of
the quadcopter pose in the environment and also calibrate the control signal required
to reach the given target position in the map. To achive this localization and map-
ping problem we implement the approach of autonomous camera-based navigation of
a quadcopter [21]. Second, we use Oriented FAST and Rotated BRIEF (ORB) to ex-
tract features from the video frames and match the extracted features with the known
features to classify if the object is an obstacle. Finally, given that the quadcopter
pose, presence of obstacle is known we implement the Function approximation and
feature-based method of reinforcement learning to navigate to all the target locations
defined avoiding the obstacle in the path.
To validate our approach we implemented our approach on a AR.Drone and tested
it in different environments. In our approach, all the computation of localization and
mapping, object recognition and reinforcement learning are performed on a ground
station, which is connected to the drone via wireless LAN. The required traning for
object classification and convergence of the weight values of features for reinforcement
learning is performed previously and stored in specific location in the ground station.
Experiments have validated that the presented methods work in various environ-
ment and is able to navigate accurately to predefined target locations.
iii

Acknowledgement
I take this opportunity to express my gratitude to all who supported me wholeheartedly
to come through many impediments without whom my thesis would not have been
possible.
Foremost, I would like to express my sincere gratitude to Prof. Dr. Peter Nauth for
inspiring me to the ﬁeld of machine learning and giving me an oppurtunity to choose
my thesis under it. I would like to express my sincere appreciation to my supervisor
Prof. Dr. Andreas Pech for his valuable support and help.
Finally, I would like to thank all my colleagues and friends for all their support.

Abbreviations
Abbreviation Meaning
MAV Miniature aerial vehicals
GPS Global Positioning System
SLAM Simultaneous Localization and Mapping
API Application Programming Interface
UDP User Datagram Protocol
TCP Transmission Control Protocol
SDK Software Development Kit
ROS Robot Operating System
MDP Markov Decision Process
RRL Relational Reinforcement Learning
FAFBRL Function Approximation and Feature Based Reinforcement
Learning
PTAM Parallel Tracking and Mapping
FAST Features from accelerated segment test
BRIEF Binary Robust Independent Elementary Features
SURF Speeded Up Robust Features
SIFT Scale-Invariant Feature Transform
ORB Oriented FAST and Rotated BRIEF
LoG Laplacian of Gaussian
DoG Diﬀerent of Gaussian
PID Proportional Integral Derivative Controller
EKF Extended Kalman Filter
UKF Unscented Kalman Filter
IDL Interface Deﬁnation Language
BSD Berkeley Software Distribution
v

Contents
Declaration ii
Abstract iii
Acknowledgement iv
Abbreviations v
1 Introduction 2
1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Theoretical background 6
2.1 Quadcopter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Basic Mechanics . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Parrot AR.Drone 2 . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 AR.Drone Software . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 Open Source Software . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Exploration versus Exploitation . . . . . . . . . . . . . . . . . . 13
2.2.2 Delayed Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.4 Relational Reinforcement Learning . . . . . . . . . . . . . . . . 16
2.2.5 Function Approximation and Feature-Based Method . . . . . . 17
2.3 Simultaneous Localization and Mapping . . . . . . . . . . . . . . . . . 19
2.3.1 Monocular SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Description of the PTAM Algorithm . . . . . . . . . . . . . . . 20
2.3.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Feature Detection and Extraction . . . . . . . . . . . . . . . . . 22
2.5 Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.1 Proportional Integral Derivative Controller . . . . . . . . . . . . 27
2.6 Sensor Integration and Filtering . . . . . . . . . . . . . . . . . . . . . . 30
2.6.1 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.2 Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . . 32
2.6.3 Unscented Kalman Filter . . . . . . . . . . . . . . . . . . . . . . 33
2.6.4 Particle Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.7 Robot Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7.1 Goals of ROS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
vi

Contents
2.7.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 Methodology 37
3.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Localization and Navigation of the Drone . . . . . . . . . . . . . . . . . 40
3.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.2 Employed Software . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Obstacle Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Autonomous avoidance of obstacles . . . . . . . . . . . . . . . . . . . . 45
3.4.1 Initialization and Training . . . . . . . . . . . . . . . . . . . . . 45
3.4.2 Simulation of training . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4 Results 54
4.1 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.1 Reinforcement Learning Accurancy . . . . . . . . . . . . . . . . 54
4.2 AR.Drone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.1 Obstacle Detection Accurancy . . . . . . . . . . . . . . . . . . . 59
4.2.2 Localization of the quadcopter . . . . . . . . . . . . . . . . . . . 60
4.2.3 Collision Avoidance Evaluation . . . . . . . . . . . . . . . . . . 61
5 Conclusion and Future Work 62
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
List of Figures 67
Bibliography 67
1

1. Introduction
Miniature aerial vehicles (MAVs) in recent years, have become an important tool in
the area such as aerial surveillance, visual inspection, military, remote farming, film-
ing etc. Quadcopters due to its small size, high maneuverability, stability and agility
enable it to fly in small indoor environment are very popular among the MAVs.
With attractive features of quadcopters, new ideas of applications have been emerg-
ing recently, some of these ideas include package delivery, ambulance drone [2], swarm
of quadcopters to collectively build structures etc. For all these applications to operate
optimal in the real world situations, it is required that the MAVs are fully autonomous
or automated. In an automated system the action is pre-determined and if the system
encounters an unplanned situation, the system stops and waits for human help to make
an action. Whereas in an autonomous system the action are taken without human
intervention, even when encountering uncertainty or unanticipated events [24].
The ultimate goal of robotics is to develop systems that are fully autonomous in
real world scenario. Although a lot of progress have been made in the field of au-
tonomous systems in recent years, still fully autonomous system in real word scenario
is yet to be achieve. One of the hinderance to achive fully autonomous system is
navigating towards the goal without any collision with the environment. To navigate
from one point to another the system must have the map of the environment, be able
localize itself in the environment and also be able to detect and avoid the obstacles
autonomously. The task of creating map of the unknown environment is known as
the Mapping problem, the task of localizing the robot in the map is known as the
Localization problem, the task of detecting object and classifying it as obstacle is Ob-
ject Recognition and finally the task of avoiding the obstacle is the Planning problem.
Similar problem of localization, detecting and avoiding the obstacles in case of other
autonomous system like the Autonomous driving [40] have been acheived with con-
siderable amount of accuracy, with the help of multi-sensors like lasers, radars, global
positioning system (GPS), stereo cameras. However the similar solution of addition
of sensors cannot be applied due to increased weight, power consumption and space
management in the case of small scale quadcopters, where only light weight sensors
like monocular camera might be feasible. Hence, we are interested in methods for
autonomous navigation with the help of sensors which are feasible for MAV’s.
The task of tracking the motion of the robot and incremental map construction
of the environment, have been widely researched in the field of computer vision and
is know as Simultaneous Localization and Mapping (SLAM). A wide range of SLAM
algorithms have been presented, which provides good localization and mapping with
2

Figure 1.1: Various application of the quadcopters. Topleft: DHL quadcopter called
parcelcopter [8] designed for delivery of medications and other urgently needed goods.
Top right: Amazon quadcopter called prime air [9] for delivery of packages. Middle:
Quadcopters programmed by ETH Zurich which are able to lift, transport and assemble
1500 small bricks of polystyrene to build a 6-meter tall tower [3].Bottom: Skycatch
providing aerial mapping on building sites [13].
3

1.1. Objectives
the use of multiple sensors, solving the SLAM problem with monocular camera is still
a challenging problem and is known as Monocular SLAM problem.
Once the robot is capable of localizing itself in the environment, to navigate to-
wards the goal position the robots must be able to detect obstacles in its path towards
the goal. To enable the robot to detect the obstacles and its relative distance with
the robot, minimum amount of sensors available in MAV’s is another challenging task.
The common practice for object detection and classiﬁcation task is accomplised by
using one of the many purposed algorithms for Object Recognition. The general idea
behind object recognition is to calibrate the feature points from the images and use
these feature for classiﬁcation of the images.
Only when the problem of localization of the robot and the obstacle with respect
to the robot’s position can the robot navigate through the obstacle without collision.
The task of avoiding the obstacle will have good accuracy only when the belief of
the robot’s position relative to the obstacle is accurate. When the obstacles are not
dynamic, path planning algorithms can provide the solution for the optimal path to
reach the target position but when the obstacles are dynamic to avoid such obstacle
is an open problem due to the dynamic behaviour of the obstacles.
A well-known approach within autonomous system for learning is reinforcement
learning. The learning includes reward and punishment from the environment to learn
an optimal policy for the experienced enviromental states. This method can be used to
gather experience of the obstacles and there behaviour and avoid collision with them
in the future.
1.1 Objectives
The objective of this thesis is to take advantage of the SLAM algorithms developed
for monocular camera for localization of the AR.Drone, detect static and dynamic
obstacles and avoid them from collision while navigating to the goal position using
reinforcement learning.
• Localization and Navigation of the Drone
With the help of the localization algorithms compute an estimate of the quad-
copter’s pose and calculate appropriate control commands to navigate and to
hold to a given position.
• Detection of the obstacle
Applying object recognization algorithms classify obstacles using the frontal cam-
era of the AR.Drone.
• Avoiding the obstacle
Applying the reinforcement learning to learn an optimal policy to avoid obstacles.
4

1.2. Overview
1.2 Overview
The paper starts with the introduction and objective of this study. In second chapter
all the necessary theoretical basis are explained. The chapter starts with descrip-
tion about the quadcopter and progress by describing Reinforcement learning and is
followed by Simultaneous localization and Mapping problem. It is followed by the de-
scription of Object Recognition and then Control algorithms, Filters and the chapter
ends with the description of the ROS. Chapter three details the methodology used in
order to achive the objective of the thesis, it starts with the outline of the method
used and then describes each part of the objective, Localization and Navigation of the
drone, Obstacle recognition and the autonomous avoidance of obstacles and finally the
software architecture designed to achive the objective is described. Chapter four con-
tains the results that was achieved when the suggested methodology was implemented
and finally in chapter five conclusion and future work is described.
5

2. Theoretical background
This chapter includes the detail description of devices and technologies used to achive
the main objective of the paper. The subsections of the chapter includes description
of quadcopter and introduction of AR.Drone, related theory of reinforcement learning,
SLAM, object recognition, control algorithms, filters and ROS.
2.1 Quadcopter
Figure 2.1: Left: AR.Drone with indoor-hull, Right: AR.Drone with outdoor-hull
The AR.Drone is commercially available electric quad rotor which was introduced
in January 2010, originally designed as a high-tech toy which can be controlled by
mobile or tablet with operating system Android or iOS. Currently the AR.Drone is
widely used for research in fields of Robotics, Artificial Intelligence and Computer
Vision due to its low cost, robustness to crashes, safety and reliability for both indoor
and outdoor usages, and the availability of open application programming interface
(API) which provides onboard sensor values and also can be used to control the drone.
In this chapter we present the working mechanics of quadcopter in Section 2.1.1,
we decribe AR.Drone, its available sensors and present the communication interfaces
for the AR.Drone in Section 2.1.2. In Section 2.1.3, we briefly present the available
software development kit for the AR.Drone. and also give a brief introduction to the
software package tum ardrone, which we used for our paper.
6

2.1. Quadcopter
Figure 2.2: AR.Drone with its coordinates, the roll, pitch and the yaw along with the
rotation of the rotor.
2.1.1 Basic Mechanics
Quadcopter is a multirotor helicopter, which is lifted and maneuvered with the help of
four rotors. Quadcopters have four fixed pitched propellers, two rotating in clockwise
and two counter-clockwise, cancelling out their respective torques. All four rotors
produces thrust as well as the torque about its center of rotation, and also produce
drag force opposite to the drone’s direction of flight. Ignoring the external influences
and the mechanical inaccuracies, when all the rotors are spinning with same angular
velocity, with two rotors rotating clockwise and two other rotors counter-clockwise,
the overall torque and the angular acceleration about the yaw axis, is exactly zero
which allows the quadcopter to be stable in the air without moving. Even if the
control mechanics of the quadcopter seems theoritically simple it is made possible
only through number of sensors and its combination with the on-board stabilization
algorithms. The different actions that can be taken in the quadcopters are listed below:
• Quadcopter can adjusts its altitude by applying equal angular velocity to all the
four rotors. Vertical acceleration can be achieved by increasing or decreasing the
angular velocity of all four rotors equally.
• Quadcopter can adjusts its yaw without changing the upward thrust and bal-
ance by increasing angular velocity of rotor 1 and rotor 3 and at the same time
decreasing the angular velocity of rotor 2 and rotor 4, which results in overall
clockwise torque. Similarly anti-clockwise torque can be generated by increasing
angular velocity of rotor 2 and rotor 4 and at the same time decreasing angular
velocity of rotor 1 and rotor 3.
7

2.1. Quadcopter
• Quadcopter can adjusts its pitch or roll by increasing the angular velocity of
one of the rotors and decreasing the angular velocity of the diametrically oppo-
site rotor, which results in change of roll or pitch angle, and hence horizontal
acceleration.
2.1.2 Parrot AR.Drone 2
The Parrot AR.Drone measures 52.5 cm × 51.5 cm with, and 45 cm × 29 cm without
hull. It has four rotors attached to its body, each with 20 cm diameter. A remov-
able styrofoam hull protects the drone and particularly the rotors during flights, which
enables it well suited for experimental flying and development. An alternative outdoor-
hull doesnt provide the rotor-protection but hence offers less protection against colli-
sions but makes the drone more agile for better maneuverability. The drone weights
380 g with the outdoor-hull, and 420 g with the indoor-hull. All the rotors are pow-
ered by their respective brushless motor. A range of sensors inside the drone helps
the drone to be stable and enable advanced flight easier. The onboard controller runs
a Linux operating system, and communication with the pilot is achieved through a
self-generated Wi-Fi hotspot. Energy is provided with the help of Lithium polymer
battery of 1000 mAh, which can allow a flight time of approximately 15 minutes.
Figure 2.3: AR.Drone and its sensors, hardware parts.
The onboard sensors present on the drone include:
• Ultrasonic altimeter: which is used for vertical stabilisation. The sensor is at-
tached on the bottom of the AR.drone and is directed downward in order to
measure the distance to the floor. The sensor has effective range of approxi-
mately between 20cm to 6m.
• Pressure sensor: along with ultrasonic altimeter these sensors are used for more
stable flight and hovering.
8

2.1. Quadcopter
• Two cameras: one directed forward horizontal and one directed downwards. The
front facing camera uses fish eye lens which has frames per second of 18 fps
with resolution of 640 × 480 pixels with field of view of 73.4°× 58.5°. The
video obtained from the front camera suffers from radial distortion due to the
use of fish eye lens, motion blur due to rapid movement of the drone, linear
distortion due to the use of rolling shutter of the camera. The camera pointing
downwards has frames per second of 60 fps with resolution of 176 × 144 pixels,
with field of view of only 17.5°× 36.5°, it has lower radial distortion, motion
blur and rolling shutter effects. Both cameras have automatic brightness and
contrast adjustment [23]. The bottom facing camera is also used by the onboard
controller for horizontal stabilization and to estimate the velocity of the drone.
• 3-axis accelerometer: which is used to measure the acceleration of the drone.
• 2-axis gyroscope: which is used to measure roll and pitch angle of the drone.
The angles measured have only deviation of 0.5°and do not suffer from drift over
time.
• 1-axis yaw precision gyroscope: which is used to measure the yaw angle. The
yaw angle suffer from drift over time.[23]
2.1.3 AR.Drone Software
As the AR.Drone is a commercial product there is no open source or documentation
of the control software. Only basic functionality can be accessed with the available
communication channels and interfaces. Communication with the AR.Drone can be
achieved through four main communication channels
• navigation channel
• command channel
• control channel
• video channel
Navigation Channel
This is an User Datagram Protocol (UDP) channel with port address 5554, this channel
broadcasts basic navigational data. In default mode of operation the broadcast of the
data is carried out at every 30 ms, in debug mode the broadcast of data is carried out
at every 5 ms. Since the communication channel is UDP there are chances where the
packet can be lost. This channel provides different onboard sensor values some of the
main values which we used in our thesis are described below.
• orientation of the drone: as roll, pitch and yaw angles.
• altitude in millimeter
• velocity
• battery status: value between 0 and 100
9

2.1. Quadcopter
• state: the current state of drone that can be ‘LANDED’,’TAKEOFF’,’HOVERING’
etc.
• timestamp: value in millisecond the drone’s internal timestamp at which the
data is sent.
Command Channel
This is also an UDP channel with port assigned as 5555, this channel receives data
for external control for navigation of the drone. Normal navigation control for drone
can be achived by assigning desired value to the respective variables, the range of
value that can be assigned is between -1 and 1. For required altitude velocity, roll
angle, pitch angle, rotational speed respective value can be assigned. To alter between
hover-mode and manual mode a flag can be set or removed. Similarly flag can be set
to take off and land. Since this communication channel is UDP, it suffers from the
vulnerability of losing the data, we therefore must re-send the command data. The
different commands which have been used in our thesis are described below,
• AT*FTRIM: This command is called before the take off for horizontal calibration,
• AT*PCMD: This command is used to control the flight motion of the quad-
copter the syntax for the command is AT*PCMD=[Sequence number],[Flag bit-
field],[Roll],[Pitch],[Gaz],[Yaw]. The Sequence number is 1 if it is a first command
sent to the quadcopter, or sequence number value of the previous command + 1,
Flag bit is a 32 bit integer if the bit has value 0 the values of Roll, Yaw, Pitch and
Gaz is ignored and the quadcopter will enter in the hovering mode. If the bit has
the value of 1, this will enable quadcopter to change the angle as provided in the
argument. The Roll, Pitch, Gaz and Yaw can have value between -1 to 1. The
roll value determines the left-right tilt, the pitch value determine the front-back
tilt, Yaw value determines the rotation of the quadcopter with respect to the
vertical axis as shown in the figure 2.2 and Gaz determines the flight height of
the quadcopter.
• AT*REF: This command is used to control the basic actions such take-off, land-
ing, emergency stop and reset. The syntax for the command is as AT*REF=[Sequence
number],[Input], where the Input is a 32 bit integer value, Bit value of 0 − 7 is
not used, bit value of 8 is set when emergency mode is required, bit value 9
determines the takeoff when set to 1 and land when set to 0.
Control Channel
This is a Transmission Control Protocol (TCP) channel with the port address of 5559,
this channel is TCP channel to provide the reliable communication and is used for
transfer of critical data such as configuration information. One of the configuration
that we used during our work was switching between the available video channels.
Video Channel
This channel is responsible to send the video stream from the front camera or the
bottom camera as per the configuration set through the control channel. This is an
10

2.1. Quadcopter
UDP channel and the port assigned is 5555. The video stream provided by this channel
has standard encoding of H.264. It can be decoded using ﬀmpeg for image processing.
2.1.4 Open Source Software
There are numerious amount of open source software found for basic navigation of
the AR.Drone such as [7] [11] [5]. All these software uses the basic client protocol
that is used by software development kit (SDK) that is provided by Parrot. The
SDK is written is C code and also consist of basic example applications. The SDK
is documented in [1] and can accomplish the task such as setting the communication
channels, receiving and decoding video, receiving and decoding the navigational data,
encoding and sending the control commands [23]. Since in our thesis we need to
use the functionality which solves the issue of localisation and mapping along with the
navigation of the drone we used an open source software [14]. The software is an Robot
Operating System (ROS) package that is based on the paper [23]. The software is well
documented and its functionality of autonomous ﬂight, visual navigation with the use
of Parallel Tracking and Mapping enables the AR.Drone to navigate autonomously in
previously unknown and GPS-denied environments.
11

2.2. Reinforcement Learning
2.2 Reinforcement Learning
Reinforcement learning [27] is a subfield of machine learning where an optimum solu-
tion is calculated with the learning process of trial and error in a dynamic environment.
It is a process of learning of what to do in given state and how to map the situation
to the given action so that numerical reward is maximum. In reinforcement learning
the agent has no knowledge how to perform the task in the given state, as in most
forms of machine learning, but instead must discover which action yield maximum
reward by trying them [16]. The action so discovered effect not only the immediate
reward but also the next subsequent rewards.This method is useful when the agent is
in an unknown dynamic large enviroment such as in the case of a robot navigating in
a previously unknown environment which is not static. In a standard setting of rein-
Figure 2.4: The cooperation between the agent and the enviroment in reinforcement
learning [16].
forcement learning the environment is modeled as a Markov Decision Process (MDP)
which consist a set of state, action and transition model. At each discrete time t,
the agent senses an state and chooses an action. The action will stochastically make
the transition to a new state, and receives a reward from the environment. Formally
the reinforcement learning helps the agent to learn the policy π : S → A that when
followed the cummulative reward is maximized.
V Π
(st) = rt + γrt+1 + γr2
tt+2 + ... (2.1)
V Π
(st) =
∞
i=0
γi
rt+i (2.2)
Here V is the cumulative reward, r is the immediate reward, π is the policy, t is the
discrete time and γ is a constant between 0 and 1 that determines the relative value
of delay reward versus immediate rewards.
Reinforcement learning algorithms are categorized as model-based or model-free. If
the agents previously has to learn from the enviroment its model, to find the optimum
policy it is model-based, and if the agents learn directly without the explicit model of
the enviroment it is model-free [27]. In our thesis we will be focusing on the model-free
because of the previously unknown environment where the drone has to explore and
choose the best action in each state to achieve the maximum reward.
12

2.2.1 Exploration versus Exploitation
Reinforcement learning is different from supervised learning and the main difference
between them is that, in supervised learning the agent learns from examples provided
by a external supervisor, but in reinforcement learning the agent must be able to learn
from its own experience by explicitly exploring the environment. During the process
of exploring the environment there will always be the challenge of trade-off between
exploration and exploitation. In order to obtain the reward, agent has to exploit what
it previously knowns, but to make better action selections in the future it also has
to explore the environment. One of the elementary reinforcement learning problem is
known as the k-armed bandit problem, which is a widely studied subject in statistics
and applied mathematics literature [27]. The problem depict the trade-off between
exploration and exploitation. There are large variety of solutions for this trade-off
problem, among them three techniques that are used in practice are as follows:
• Dynamic Programming Approach
Dynamic programming is widely considered as the optimum method of solving
a general stochastic optimal control problems. Dynamic Programming is a col-
lection of algorithms which are used to obtain the optimal policy of the given
perfect model of the environment as a Markov decision process. This approach
suffers from problem of dimensionality where the computational requirements
grow exponentially with the number of states.
• Greedy Strategies
In this strategy the action with the highest reward is always choosen. The prob-
lem with this strategy is that if an early sampling indicate that the suboptimum
action reward is more than the best action’s reward. The suboptimum action is
always choosen which means the optimal action is always without enough data
and its superiority never disclosed.
• Randomized Strategies
In this strategy of exploration the action with best estimated expected reward is
choosen by default, but with probability p. To encourage initial exploration in
some version of this strategy, a high starting value of probability p is choosen,
which is decreased slowly to increase exploitation and decrease exploration later
[27].
2.2.2 Delayed Reward
Generally in reinforcement learning the agent action not only determine its immediate
reward, but also the next state in the environment. This kind of environment can be
imagined as network of k-armed bandit problems, but the agent must evaluate the
next state as well as the immediate reward when the agent decides to take an action.
To achive a state with high reinforcement the agent must go through a long sequence
of low rewarding actions. Finally the agent must be able to learn which action on the
current state is best for achieving greater reward in far future. The problem that arise
due to delayed reward can be modeled as Markov decision Processes (MDPs). MDP
consist of:
• a set of states S,
13

• a set of actions A,
• a reward function R : S ∗ A → R, and
• a state transition function T : S ∗ A → Π(S)
Two algorithms for learning the optimal policy in MDP environments given a correct
model is provided are
• Value Iteration
• Policy Iteration
For acheiving the optimal policy in MDP environments the must previously have the
knowledge of the model which consist of state transition probability function T(s, a, s )
and the reinforcement function R(s, a). Since some form of the model of the environ-
ment is available these problem can be considered as model-based planning problems
but if the model of the environment was not available this type of problem is model-free
which is considered as genuine reinforcement problem.
2.2.3 Q-Learning
Q-learning is a model-free reinforcement learning. Agents are provided the capability
of learning to take actions which are optimum in Markovian domains by experienc-
ing the consequenses of actions, without requiring agents to build the model of the
domains [43]. Q-learning is the most popular and seems to be the most eﬀective
model-free algorithm for learning from delay reinforcement [27].By repeatedly trying
all the actions in all the states, the agent learns which actions are the best, judged by
long-term discounted reward. Q-learning even though is a primitive form of learning,
it can operate as the basis of far more sophisticated devices [43]. In remainder of this
section, we use the following notation:
• a : action of the agent
• s : state of the agent in the environment
• Q∗
(s, a) : expected discounted reinforcement of taking an action a in state s,
then continuing by choosing optimum action
• V ∗
(s) : value of s assuming the best action is taken in the initial state, which
can be written as V ∗
(s) = maxaQ∗
(s, a)
The equation of Q∗
(s, a) can be written as
Q∗
(s, a) = R(s, a) + γ
s ∈S
T(s, a, s )maxaQ∗
(s , a ) (2.3)
Since V ∗
(s) = maxaQ∗
(s, a), we have π∗
(s) = arg maxaQ∗
(s, a) as an optimal policy.
Because the action is made explicit by the Q function, we can estimate Q values and
also use them to deﬁne the policy as action can be chosen just by selecting the one
which returns maximum Q value for the current state. The Q-learning rule hence can
be written as:
Q(s, a) := Q(s, a) + α(r + γmaxaQ(s , a ) − Q(s, a)) (2.4)
14

where (s, a, r, s ) is an experience tuple, 0 ≥ α ≤ 1 is the learning rate parameter and
γ is the discount factor. If each action is selected in each state an infinite number of
times on an infinite run and α is decayed appropriately, the Q values will converge
with probability 1 to Q∗
[44]. In Algorithm: 1 Q-learning algorithm is described.
initialize no of episode;
for each s, a do
initialize Q(s, a) = 0;
end
e : 0;
while e is less than episode do
e := e + 1;
create a starting state s0;
i := 0;
while is not goal(si) do
select an action ai from the policy of the state and execute it;
receive the immediate reward ri = r(si, ai) from the environment;
observe the new state si + 1;
i := i + 1;
end
for k = i − 1 to 0 do
update value Q(sk, ak) := rk + γmaxa Q(sk + 1, a );
end
end
Algorithm 1: Q-learning Algorithm
When optimal value of Q are about to converge, it is appropriate for the agent
to act greedily and take actions which provides highest Q value. During the learn-
ing phase their exits a difficult problem of exploration versus expliotation trade-off
which we dicussed in 2.2.1 section. Q-learning is exploration insensitive [27] mean-
ing that the optimum value of Q will be converged not depending on how the agent
chooses the action when initial data is being collected (until all state-action pairs are
experimented often enough). Which means although there will be exploration versus
expliotation trade-off the exploration technique will not affect the convergence of the
learning algorithm. The technique to make the algorithm independent to exploration
and exploitation problem is by selecting action a in state s stocastically so that Pr(a|s)
is dependent on Q(s, a).
Pr(ai|s) = T−Q(s,ai)
j
T−Q(s,aj)
(2.5)
Due to these reasons of convergence independent of the exploration techniques Q-
learning is the most widely used and most effective model−free algorithm for learning
from delayed reinforcement for small domains. For lager domains it is infeasible since
the computational requirements grow exponentially with the number of states and
memory requirements for storing all the state-action pairs. Relational Reinforcement
Learning is one approach to solve this problem [26] which we will discuss in the next
section.
15

2.2.4 Relational Reinforcement Learning
Relational reinforcement learning was introduced by [19]. RRL highlights on a re-
lational rather than the traditional attribute-value representation. RRL is obtained
by the combination of the classical Q-learning algorithm, probabilistic selection of ac-
tions and a relational regression algorithm. Using RRL, even if the agent only learns
on a group of diﬀerent small-size problem instances and is exposed to the complex
environment, it performs better given that the environment have similar structure. In
Algorithm: 2 Relational Reinforcement Learning Algorithm is described.
for each s, a do
initialize Q(s, a) = 0;
end
e : 0;
e := e + 1;
i := 0;
select an action ai stochastically as in 2.5 and execute it;
i := i + 1;
end
qk := rk + γmaxa Qe(sk+1, a );
generate values x = (sk, ak, qk);
if value (sj, aj, qold) exists in values then
replace value;
else
add value to values;
end
update value of Q;
end
end
Algorithm 2: Relational Reinforcement Learning Algorithm
16

2.2.5 Function Approximation and Feature-Based Method
[26] introduced this method of reinforcement learning where a function approximation
approach in taken into action to reduce the computational requirements with similar
results that RRL provides. In this algorithm each state is represented as a small fixed
number of features that is not dependent on the number of states, i.e if there is states
with similar feature these can be represented as one. As the algorithm is independent
on the number of states it enables it to perform in high dimensional state space. In
remainder of this section, we use the following notation:
• Q − function : linear combination of features (features describe a state)
• f1, f2, ..., fn : set of features where n is the number of features
• a : action
• θ1, θ2, ..., θn : specific weight of each features
• α : learning rate
• γ : the discount factor
• r : the immediate reward of the currrent state
The equation for the Q-functions can be written as:
Qa
(s, a) = θa
1f1 + ... + θa
nfn (2.6)
after an action a. The update of the weight of features is calculated as follows:
θa
k = θa
k + α r + γmaxa Qa
(s , a ) − Qa
(s, a)
dQa
(s, a)
dθa
k
(2.7)
The weight of the feature values i.e θa
k represents the weight of the feature k after
the action a is performed. The equation minimizes the error using the well known
least square method, first the equation 2.6 is used to calculate the Qa
values of the
state which are represented with their respective features multiplied by there respective
weights θ1, θ2... which is in initial state have some constant value. After the action
is performed we can calculate the required Q value, as in equation 2.4 which is Q =
r + γmaxa Qa
(s , a ) after which we have the calibrated Qa
value and the required Q
value, hence the error can be calculated simply by subtracting Q and Qa
. Now, to
minimize this error we use the least square method which results in the above equation
2.7. As a state of the environment is represented by thier features, if more features
are added classification of states will be easier, but as we introduce new features to
represent the states these features might also introduce the overfitting problem in
minimizing the error term. The number of features should be selected wisely as there
are greater probability of occurance of overfitting problem if there are larger number
of features, which will make the overall algorithm less optimum.
17

In Algorithm: 3 Reinforcement Learning with Function Approximation and Feature-
Based Method Algorithm is described.
for each s, a do
initialize θ(s, a) = 0;
end
e : 0;
e := e + 1;
i := 0;
select an action ai stochastically as in 2.5 using current values of θ and
execute it;
i := i + 1;
end
update value of Q using 2.6 ;
update value of θ for all the actions using 2.7;
end
end
Algorithm 3: Function Approximation and Feature-Based Method Algorithm.
[26]
18

2.3. Simultaneous Localization and Mapping
2.3 Simultaneous Localization and Mapping
Simultaneous localization and mapping (SLAM) is the process of approximating the
localization of the robot and simultaneously acquiring the approximate map of the
unknown environment. SLAM includes localization problem, where the map is know
while the pose of the robot is unknow and mapping problem where the pose of the
robot is know where as the map is unknown. SLAM is an important component
of autonomous systems for navigation in an unknown and Global Positioning System
(GPS) denied environment. Different kinds of sensors can be used in the SLAM system
which have different impact on the algorithm used in SLAM. Range sensors like lasers,
ultrasound, stereo-camera can be used which provides depth information along with
the color information and lastly monocular camera can also be used. Sensors which
provide more information like in the case of stereo camera can greatly reduce the
computational cost with the possibility of depth information which eliminates several
challenges. Nevertheless one of the drawback of depth-measuring-devices is that it
operates accurately on limited range which constrains it to operate in open spaces.
Due to the avalibility of monocular camera in the AR.Drone in this thesis we are using
monocular SLAM, which is SLAM based on a monocular camera. In this section we
will first describe Monocular SLAM and some of the commonly used methodologies
in 2.3.1, here we give a brief description of the some methodologies. In section 2.3.2
introduce Parallel Tracking and Mapping. And finally in section 2.3.3 we will list the
available softwares for the SLAM.
2.3.1 Monocular SLAM
Monocular SLAM have advantages over other systems in terms of cost, flexibility and
light weight which is an important factor in case of airborne robots. Even though there
are considerable amount of research carried out on monocular SLAM, it is still a chal-
lenging problem. Particularly dealing with dynamic environments, efficient detection
of loop-closures, reducing computational complexity and scale ambiguity. With the
use of camera rich information of the environment can be obtained but single image
only provide direction of features present in the environment and does not provide
the depth information, which create the problem of scale ambiguity. To get the depth
information multiple images from different camera positions are required. There are
different SLAM methodologies that has been implemented using monocular camera,
some of the methodologies studied in the course of preparing the paper are as follows:
• Extended Kalman Filtering (EKF-SLAM)
EKF-SLAM was the first purposed solution to the SLAM problem. The cur-
rent position of the robot and the position of the landmark were represented as
state vector. The main problem with this approach is their computational cost
increases quadratically with the growth of the number of the landmarks, which
limits the number of landmarks upto 100 [18].
• FastSLAM [34]
This method uses particle filter instead of the traditionally used Kalman filter
where each particle is responsible for representing a single possible path that
could be taken by the robot and maintaining its own estimate of all landmark
19

positions. This method reduces the computational cost from quadratic to loga-
rithmic, making it possible for larger environments [34]
• RatSLAM [32]
RatSLAM is a bio-inspired SLAM system derived from the neural models of rat
hippocampus. There are considerable amount of enhancement has been carried
out on RatSLAM for increasing its performance and its utility for real time
SLAM system like [30], [31], [33] etc.
• Parallel Tracking and Mapping (PTAM) [28]
PTAM is a method for estimating the camera pose in an unknown scene. Here
the tracking and mapping are seperated to two different tasks, one responsible
for robustly tracking the erratic motion of the device, while another produces a
3D map of point features from previous video frames [28]. The drawback of this
technique is that it can be only used for working in small environment.
Among different methodologies two of the methodologies have been predominant
one is the filtering approaches such as the EKF-SLAM or FastSLAM, where the mea-
surements from all images are fused together sequentially by updating probability
distributions over features and camera pose parameters [39]. Another is the Keyframe-
based approach where which retain a selected subset of previous observations which is
called keyframes, explicitly representing past knowledge gained. The Keyframe-based
approach has computational advantage as compared to the filtering approaches.
2.3.2 Description of the PTAM Algorithm
Parallel tracking and Mapping (PTAM) was introduced in [28]. The algorithm is a
keyframe-based algorithm hence is computationally less expensive and robust. As pre-
viously mentioned in the section 2.3.1, the algorithm split the simultaneous localization
and mapping task into two seperate threads running independently as tracking thread
and the mapping thread. The responsibility of tracking features in the image frame
to determine the pose of the camera is done by the tracking thread. Mapping thread
is responsible for optimizing the map and integrating new keyframes and landmarks.
As a prerequiste both process requires a initial map which is created by a seperate
initilization procedure.
Optimizing a map of an unknown environment while simulataneously tracking posi-
tion withing the same map, which is the SLAM problem, is enlarged in the case of
monocular camera due to the lack of depth information from the monocular camera.
To create the initial map an initialization process is required, general initialization
algorithm is as follows:
1. get the first frame from the camera
2. fetch keypoints p1, p2, ...pn from the frame and take the first keyframe K1
3. move the camera and from the image frame track the keypoints p1, p2, ...pn
4. detect the distance the camera moved and if the keypoints cannot be triangulated
repeat the step 3
20

5. extract new keypoints p1, p2, ...pn and take the second keyframe K2
6. create the initial map
Figure 2.5: Keypoints in the PTAM algorithm used for tracking. Different color repre-
sents different scale, red points represents small keypoints and blue points represents
large patch.
Keypoints
Keypoints are the local feature points which is described previously in section 2.4.
Since PTAM algorithm requires keypoints from each frame for tracking, the keypoint
should be generated with no time delay therefore FAST corner detector is used which
is described in the section 2.4.1. The FAST corner detector which is used in the
PTAM algorithm has four different scales. Eventhough FAST corner detector has the
disadvantage of small number of scales available due to its discrete nature, translation
wrap function is sufficient for frame to frame tracking due to the fact that changes in
the scale and the perspective are small.
2.3.3 Software
Even if the SLAM problem has been intensively studied in the robotics community and
different methodologies have been purposed only few of the implementation are avail-
able. Some of the available softwares are RatSLAM [12], PTAM [10], HectorSLAM[4]
and Linear SLAM [6]. In our thesis we are using open source software [14] which uses
the PTAM software [10] for SLAM. The software is an Robot Operating System (ROS)
package that is a modified version of the monocular SLAM framework [28].
21

2.4. Object Recognition
2.4 Object Recognition
Object recognition is the process of detecting objects of a certain class in a digital
image. Object recognition consist of Feature Detection where the features from the
images are detected, Feature Extraction where the feature detected is represented such
that it can be used to compare with other features in the image, Feature Matching
where the extracted features are matched using Classification Algorithms in order to
classify the class of the object.
2.4.1 Feature Detection and Extraction
Feature Detection is a method which computes the image information by making local
decisions at every point of the image whether the point consist the type of the feature
defined or not [15]. There are different types of features in an image edges, corners,
blobs or regions and ridges. To extract information from these features different fea-
ture detectors can be used such as Harris Corner Detector, Canny Edge Detector,
Features from accelerated segment test (FAST) etc.
Feature Extraction is involved in reducing the information required to describe a
large set of data. It is a method where a initial set of measured data is converted
to set of features known as the feature descriptor or feature vectors which is more
informative, non redundant, simplifying the learning and generalization step. Feature
extraction can also provide additional features such as the edge orientation, gradient
magnitude, polarity and its strength. Some of the well knowned feature descriptors
are listed below:
• Binary Robust Independent Elementary Features (BRIEF)
• Speeded Up Robust Features (SURF)
• Scale-Invariant Feature Transform (SIFT)
• Oriented FAST and Rotated BRIEF (ORB)
Harris Corner Detector
Harris Corner Detector was presented in [25] as the combination of edge detector
and corner detector, this detector is one of the earliest and most widely used corner
detector. Corners are the regions in the images where there exists large variation in
the intensity in all directions, representing this in matematical form the difference of
intensity for a displacement of (u, v) in all directions can be calibrated by the following
expression:
E(u, v) =
x,y
w(x, y) I(x + u, y + v) − I(x, y)
2
(2.8)
Here the change in appearance is denoted by E(u, v), w(x, y) is the window function
or weighting function, I(x + u, y + v) is the shifted intensity value and I(x, y) is the
intensity. For detection of the corner E(u, v) should be maximized, so maximizing
22

I(x + u, y + v) by the use of Taylor Expansion and simplifying it the equation can be
written as:
E(u, v) ≈ u v M
u
v
whereM =
x,y
w(x, y)
I2
x IxIy
IxIy I2
y
(2.9)
Figure 2.6: The corner is represented by red area where both eigen values are high,
for the edges represented by blue only one of the eigen value is high and for the flat
surface both the eigenvalues are similar to 0.
Here Ix and Iy are the image derivative in x and y directions. M which is also
called the structural tensor of a patch two values maxu,vE(u, v) and minu,vE(u, v) can
be calibrated which are two eigenvalues of E, denoted by λ1λ2. These two values can
be further used to determine if a window consist of corner or not by using following
equation:
R = λ1λ2 − k(λ1 + λ2)2
R = det(M) − k(trace(M))2 (2.10)
FAST Corner Detector
Features from accelerated segment test (FAST) corner detector was presented by Ed-
ward Rosten and Drummond in 2006 [36], this algorithm is significanlty faster than
other methods, which enables it to use in the real-time application such as SLAM
problem. The basic summary of the algorithm is as follows:
1. Select pixel p from the image calibrate the intensity of the pixel Ip
2. Define the threshold value t
3. Calibrate the intensity values of 16 pixels around the pixel p
4. Pixel p is a corner if a sequence of pixels among the 16 pixels have intensity
higher than Ip + t, or intensity lower than Ip − t
23

5. High-speed test is then introduced to reduce the number of non-corners, by only
examining four of the 16 pixels at 1, 9, 5 and 13.
Figure 2.7: FAST corner detection with the 16 pixels aligned in circle around the
investigated pixel p [29].
SIFT
Scale-Invariant Feature Transform (SIFT) is an algorithm to detect local features in
the image which was published by David Lowe in 1999 [29]. The features extracted
using this algorithm is scale invariant. The algorithm consist of following steps to
detect the features:
1. Scale-space feature detection
To detect keypoints with different scale, scale-space filtering is used in the algo-
rithm. Laplacian of Gaussian (LoG) can be used to calibrate local maxima across
different scale and space (x, y, σ) for the image with different value of scaling pa-
rameter σ. Since LoG is computationally expensive, so SIFT algorithm utilizes
Difference of Gaussians (DoG). Difference of Gaussians is calibrated by differ-
ence between a Gaussian blurred image with different scalling factor σ. When
the DoG is calibrated, images are searched for local extrema over the scale and
space.
Figure 2.8: Gaussian pyramid for different scale [29].
2. Keypoint localization
After the keypoints are located they are refined to get better results. Taylor
24

series expansion of scale space results in better location of the extrema, and
if the extrema is less than a threshold (contrast threshold) value of 0.03, it is
neglected. Similarly keypoints with ratio of eigen values from the Harris corner
detector (edge threshold) having greater than 10 are also neglected to remove
detected edges. After eliminating edge keypoints and low-contrast keypoints only
strong intrest points only remains.
3. Orientation assignment
To achieve the invariance to image rotation each keypoint is assigned to the
orientation. Gradient magnitude and direction is calculated around the neig-
bourhood of the keypoint. An orientation histogram of 36 bins which covers
360 degrees is created. This enables stability in matching keypoints with same
location and scale having different directions.
4. Keypoint descriptor
A 16∗16 neighbourhood around the keypoint is taken and divided into 16 blocks
of 4∗4 each, where each of the 4∗4 block 8 bin orientation histogram is created.
In total 128 bin values are created, which are represented in vector form as
keypoint descriptor.
5. Keypoint matching
Nearest Neighbour algorithm is used to match the keypoints between two im-
ages. For obtaining better accuracy while matching the ratio between the closest
distance and second closest distance is calibrate and if the ratio is greated than
0.8 these keypoints are neglected. This method of eliminating the keypoints us-
ing the ratio results in eliminating 90% of false matches and discards 5% of the
correct matches [29]
Figure 2.9: SIFT algorithm using 3 octave layers, 0.03 as contrast threshold, 10 as edge
threshold and 1.6 as scalling parameter.The middle figure shows successful matching
of the SIFT keypoints when the object is rotated by 90 degree, similarly the right
figure shows the successful matching of the SIFT keypoints when the object is rotated
by 180 degree.
25

BRIEF
Binary Robust Independent Elementary Features (BRIEF) is just a feature descriptor,
it requires features detectors to find features from the image. Feature detectors like
SIFT requires large memory space for storing the descriptors, which also increases the
computational time for the matching process. BRIEF descriptor rely on a relatively
small number of intensity difference tests to represent an image patch as a binary
string. The matching and the building of the descriptor is much faster than other
methods with high recognition rate as unless there is large in-plane rotations [17].
ORB
Oreinted FAST and Rotated BRIEF (ORB) was introduced by Ethan Rublee, Vin-
cent Rabaud, Kurt Konolige and Gary R. Bradski in 2011 as an alternative to SIFT or
Speeded Up Robust Features (SURF) [37]. ORB increases the matching performance,
decreases the computational cost as compared to SIFT or SURF algorithm. ORB
uses FAST keypoints detector for feature detection and used BRIEF descriptor with
modifications for the description of the features detected to enhance the performance.
ORB contributes the addition of a fast and accurate orientation component to FAST,
efficient computation of oriented BRIEF features, analysis of variance and correlation
of orientated BRIEF features, method for de-correlating BRIEF features under ro-
tational invariance which leads to better performace in nearest-neighbor applications
[37].
Figure 2.10: ORB using FAST keypoints and BRIEF for features matching.
26

2.5. Control
2.5 Control
To control a dynamic system, the behaviour of the system with respect to the given
input has to be modeled, analysed and controlled. This process of modelling, analysis
and control of a system is done with the help of control theory. The primary objective
of the control theory is to calculate the required input value that has to be given to
the system in order to reach the desired goal. Mathematically, if the input values to
the system at time t is u(t), the desired goal of the system at time t is w(t), measured
output at time t is y(t) and the error calculated between the desired and the measured
output is e(t) the goal of the control system is to minimize the e(t) and also converge
the system model without osciallation around the desired goal.
Figure 2.11: General control of a dynamic system: feedback y(t) is substracted from
the desired w(t) which gives the error e(t) which the controller tries to minimize.
2.5.1 Proportional Integral Derivative Controller
A proportional integral derivative controller (PID controller) is a control loop feedback
controller which consist of three seperate control mechanism which are added to gen-
erate the required input or control signal. The three mechanism of the PID controller
are :
• Proportional Controller
The proportional part directly depends on the present error e(t) i.e when the
error is high then the proportional controller inputs stronger control signal to
the system to reach the goal.
Pout ∝ e(t)
Pout = Kpe(t)
Here the constant Kp is known as proportional gain constant. A high propor-
tional gain results in stronger control signal for small change in the error, if the
constants is very high then the system can be unstable and oscillations around
the goal is discovered.
• Integral Controller The integral part is proportional to the accumulated past
error t
0 e(τ)dτ. The integral term is responsible to remove the steady-state error
27

2.5. Control
which occurs with a proportional controller. The constant Ki known as the
integral gain have to be choosen carefully so that minimum overshoot occurs
due to the integral controller.
Iout ∝ t
0 e(τ)dτ
Iout = Ki
t
0 e(τ)dτ
• Derivative Controller The derivative is proportional to the predicted future error
which is determined with the help of the slope of the error over time. The
derivative controller is responsible to dampening the oscillation and reducing the
overshoot cause by other controllers. The constant Kd is known as the derivative
gain.
Dout ∝ de(t)
dt
Dout = Kd
de(t)
dt
The mathematical expression for the PID controller with proportional, integral and
derivative controller
u(t) = Kpe(t) + Ki
t
0
e(τ)dτ + Kd
de(t)
dt
(2.11)
Figure 2.12: PID controller with the combination Proportional, Derivative and Integral
Controllers.
28

2.5. Control
Figure 2.13: Proportional controller represented with green dotted line, Proportional
Integral controller represented with blue dotted line and Proportional Integral Deriva-
tive controller represented with red line. With the combination of the proportional
and derivative controller which is represented by the blue dotted line the overshoot and
oscillation is decreased but the steady state error remains. To reduce the steady state
error integral controller is added which reduces the steady state error but introduce
some overshoot and oscillations. The amount of overshoot, oscillation and settling
time is dependent on the constant value of Kp, Ki and Kd
29

2.6. Sensor Integration and Filtering
2.6 Sensor Integration and Filtering
Robotics generally deals with systems that change their state over time these system
requires both software and hardware to gather, process and integrate the information
which are accumulated by the use various sources (sensors). The information gath-
ered from the various sources are vital for autonomous behaviour of the robot, but
the available information from the real-world sensors are incomplete, inconsistent, or
imprecise. Depending only on the current information available from the sensors leads
to unstable and poor behaviour of the robot. To achieve better estimation of the state,
collective use of multiple sensors is an important factor to enable the system to inter-
act and operate in an unstructured environment without complete control of a human
operator. In order to achieve autonomy and efficiency,Sensor Integration and Filtering
is a crucial element that can fuse, model the dynamics of the system and interpret the
information available for knowledge assimilation and decision making [38].
In this section, we introduce Kalman filter in Section 2.6.1 then we introduce extended
Kalman filter in section 2.6.2 which is the extension of Kalman filter for nonlinear sys-
tems. In sections 2.6.3 and 2.6.4 we give brief description of unscented Kalman filter
and particle filters.
2.6.1 Kalman Filter
Kalman filter is a method used in many applications for filtering noisy measurements,
fusing the measurements from different sensors, generating non-observable states and
prediction future states. Kalman filter can produce optimal estimate of the state if
the system is linear, the sensor values have Gaussian distribution, the measurements
are subject to independent.
Kalman filter is two-step process where the first step is prediction which is also called
the motion model, where the filter estimates the current state variables with there
respective uncertainties and the next step is updating which is also called the sensor
model gathers the sensor data and their repective weight depending on accuracy of
the sensor. The filter continously repeats the process of prediction and updating, to
calculate the estimation of the state of the system along with its uncertainity. We use
the following notation in the remaining section:
• xk ∈ Rn
: prediction of state at time k, ˆxk|j is the representation of the estimation
of the state incorporating the measurement at time j,
• Pk|j ∈ Rn∗n
: predicted covariance value of ˆxk|j,
• B ∈ Rn∗d
: control input model, responsible to map the control input vector
uk ∈ Rd
effect on the state, It represents how the control changes from xk−1 to
xk,
• F ∈ Rn∗n
: state transition model, responsible to map the state at time k − 1 to
the state at time k without considering the controls,
• zk ∈ Rm
: observation at the time k,
• H ∈ Rm∗n
: observation model, responsible to map the state xk to the respective
observation zk at the time k,
30

Figure 2.14: Motion model which is the prediction step and Observation model which
is the updating step of the Kalman ﬁlter in continuous loop to estimate the state with
the help of sensor data.
• S: innovation covariance,
• K: Kalman Gain,
• I: Identity Matrix,
In the prediction step (motion model) the state of the system is estimated and is
updated from the system dynamics. Here the uncertainity of the system grows since
no observation is made.
ˆxk|k−1 = Fˆxk−1|k−1 + Buk
Pk|k−1 = FPk−1|k−1FT
+ Q
(2.12)
In the updating step (observation model) expected value of the sensor reading is
calculated and then the diﬀerence is calculated between the expected value and the
real sensor value. Then the covariance is calculated for the sensor reading and Kalman
gain is computed after that. Now the Kalman gain is used to estimate the new state.
Here the uncertainity of the system decreases.
yk = zk − Hˆxk|k−1
Sk = HPk|k−1HT
+ R
Kk = Pk|k−1HT
S−1
k
ˆxk|k = ˆxk|k−1 + Kkyk
Pk|k = (I − KkH)Pk|k−1
(2.13)
31

2.6.2 Extended Kalman Filter
Extended Kalman Filter is the nonlinear version of the linear Kalman filter. Here the
state transition, observation model and the control model can be defined by differen-
tiable functions.
xk = f(xk−1, uk−1) + wk−1
zk = h(xk) + vk
(2.14)
Here functions f and h represents the state and the observation, depending on the
previous estimation fo the state and the observation. w and v are the state estimation
and observation noise respectively. In order to make the Kalman filter algorithm work
the function f and h are linearize using Taylor approximation.
Fk−1 :=
∂f
∂x ˆxk−1|k−1,uk
Hk :=
∂h
∂x ˆxk|k−1
(2.15)
In the prediction step (motion model) the state of the system is estimated using
the function f and in the updating step (observation model) the state of the system
is estimated using the function h. Apart from the these changes the algorithm of the
linear Kalman filter is similar as shown below.
ˆxk|k−1 = f(xk−1|k−1, uk)
Pk|k−1 = FPk−1|k−1FT
+ Q
(2.16)
yk = zk − h(ˆxk|k−1)
Sk = HPk|k−1HT
+ R
Kk = Pk|k−1HT
S−1
k
ˆxk|k = ˆxk|k−1 + Kkyk
Pk|k = (I − KkH)Pk|k−1
(2.17)
32

2.6.3 Unscented Kalman Filter
The unscented Kalman filter (UKF) is an extention to Extended Kalman filter (EKF).
In EKF the distribution of the state is approximated by a Gaussian random variable,
which when propagated analytically through the first-order lineariztion of nonlinear
system. This linearization method produce large error in the case where the system is
highly non-linear, which lead to sub-optimal performance and divergence of the filter.
The UKF approach this problem using a deterministic sampling approach [42]. UKF
does not approximate the non-linear process and observation models it rather uses the
nonlinear models and approximates the distribution of the state random variable [41].
2.6.4 Particle Filters
Particle Filters or Sequential Monte Carlo methods allow a complete representation
of the posterior distribution of the states, therefore computation of the statistical
estimates like the mean and variance are easy. Due to this particle filters can deal
with high non-linearities [41]. Both state and the observation model hence are no
longer required to be Gaussian distribution which enables the method to track multiple
hypothesis. In particle filter the posterior density is represented using set of particles.
Since the number of particles required grows exponentially with the dimension of the
state this approach is computationally expensive for large state dimensions.
33

2.7. Robot Operating System
2.7 Robot Operating System
Robot Operating System (ROS) is an open-source meta-operating system for robot. It
is not an operating system where there is process management and scheduling of the
process but it provides services similar to the operating system such as the hardware
abstraction, low-level device control, implementation of commonly-used functionality,
message-passing between processes and package management. All the services con-
tributes to the structured communications layer above the host operating systems of
a heterogenous compute cluster [35].
Figure 2.15: Typical ROS network configuration [35]
The scale and the scope of robotics is very large and it continues to grow therefore
writing a software for robots is difficult. The problem of varying hardware in the
robots makes the problem more challenging as a the concept of code reusability is
nontrivial in these cases. As the software includes functionality starting from driver-
level and continuing up through perception, abstract reasoning and beyond the size
of the code is terrifying. Hence there must a framework to manage the complexity,
create modular software and create tool-based software development possible. ROS
provides these functionalities which enables robotics software architectures to support
large-scale software integration efforts. A brief description of the philosophical goals
of ROS are described in the later sections.
2.7.1 Goals of ROS
Peer-to-peer
In a robotics software there are many processes which might be very computationally
expensive such as computer vision or speech recognition which if not processed in a
different thread might cause lag in the whole system. In a system built using ROS
there are numerous amount of processes which are running on parallel, potentially on
a number of different hosts which are connected at runtime in a peer-to-peer topology
as shown in figure 2.15. Peer-to-peer connectivity with buffering enables the system to
entirely avoid the issue of unnecessary traffic flowing accross the link when a central
server either onboard or offboard routes messages which are contained in the subnets.
In ROS to allow processes to find each other a lookup mechanism is applied which is
called master or name service. Service are defined by a string name used for request
and a pair of stricly typed messages used for the response.
34

Multi-lingual
ROS is designed as language-neutral software platform due to the fact that each pro-
gramming languages have there own benifits and tradeoffs depending on the run-
time, debuging process, syntax etc. ROS supports four programming languages C++,
Python, Octave, and LISP. To support cross-language software development, it adopts
language-neutral interface definition language (IDL) for messages which are sent among
the modules. With the use of IDL for messaging purpose modules developed with dif-
ferent languages can be mixed and matched as per the requirement. In our thesis also
there are modules created in python and modules created in C++ which are combined
together to acquire the required objective.
Tools-based
ROS have a micokernel design, where numorous amount of small scaled tools are used
to build its different components which helps to manage its complexity. These small
tools are responsible for navigation of the source code tree, get and set the configuration
parameters, visualization of the peer-to-peer connection topology, measurement of
bandwidth utilization etc [35].
Thin
Reusability of codes is a vital part of a software development, creating small modules
to accomplish small tasks enable the modules to be reused. Modular design of software
is also very useful for unit testing, where each unit or module is tested for its respective
outputs. The ROS build system performs modular builds inside the source code tree,
and with the help of CMake it enables ROS to follow the thin ideology easily. In our
thesis too we use this ideology, by reusing the libraries which are created for controlling
the AR.Drone and parallel tracking and mapping.
Free and Open-Source
ROS provides full source code publicly which is benificial for facilitate debugging at
all levels of the software stack [35]. ROS is licensed under BSD, which allows both
commercial and non-commercial project development.
2.7.2 Terminology
• Nodes
Nodes are software module which are responsible for single process in the ROS
system. Each nodes might be excuted in seperate host and communicates with
other nodes with peer-to-peer links.
• Messages
Communication between nodes are done with the help of messages which stricly
typed data structure. The data structure of messages are defined in an small
txt file with basic information of the message type it contains which enables the
ROS to be multi-lingual.
35

• Topic
A node can receive or send a message by subscribing to the given topic or publish-
ing to the given topic respectively. Topic is simply a string such as tum ardrone
or cmd vel. A single node can subscribe to multiple topic and also can publish
messages to multiple topic.
Figure 2.16: An example of ROS nodes and its graphical representation which
is used in our thesis. Two nodes ardrone driver and ardrone controller are
shown in the figure. The link between the nodes, are the peer-to-peer links. Node
ardrone controller publishes messages which are strongly typed to the respective
topic which are /cmd vel, /ardrone/takeoff, /ardrone/land, /ardrone/reset and node
ardrone driver subscribes to the specific topic to receive the respective message sent
by the publisher. The messages are strongly type for example the message for the
topic /ardrone/image raw which is published by the node ardrone driver and is
subscribed by the node ardrone controller is of type imagetransport, similarly the
message type of /cmd vel is of the type geometry msgs ::Twist
36

3. Methodology
In this chapter, we describe our approach to achieve the objectives of this thesis. In
Section 3.1, we give a synopsis of our approach and there relation. In section 3.2 we
describe the implementation of the SLAM algorithms and the employed software to
achive the localization and navigation of the AR.Drone. In section 2.4 we descibe
the procedure of how object recognition was done to recognize obstacles. In section
3.4 we describe the procedure how we implemented the reinforcement learning the
initialization procedure along with the description of the features, actions and rewards
used for the learning algorithm for autonomous avoidance of obstacles. Then in section
3.5 we present the software architecture.
3.1 Outline
Our approach to achieve the objectives of the thesis have following components:
• Localization and Navigation of the AR.Drone: In order to get the current
pose of the quadcopter and to control it, the approach of autonomous camera-
based navigation of a quadrocopter [23] is applied. The approach estimates the
drone’s pose with the help of monocular SLAM algorithm (PTAM) which is
described in section 2.3.2. It applies Extended Kalman filter described in section
2.6.2 to fuse the pose estimate from the PTAM and available sensor data to
provide the estimate of scale of the map and use PID controller as described in
section 2.5.1 to calculate appropriate control commands to navigate to the target
position.
• Obstacle Recognition: To detect obstacle, object recognition algorithm ORB
which is described in 2.4.1 is applied to the incoming video frames. The process
of classifying different class of objects consist of training phase where each class
of the object has its FAST corner feature points and BRIEF descriptors defined.
After matching the feature descriptors from the incoming frame to the know
feature descriptors of the trained set, the matched feature values are sent to the
obstacle classification process, where it is determined if the matched features are
good enough to classify it as an obstacle or not.
• Autonomous avoidance of obstacles: Based on the estimate of the drone’s
position, detection of the obstacle and the map of the environment, Function
approximation and feature-based method of reinforcement learning described in
2.2.5 is used to calculate the optimal policy which provides the control command
which the drone must apply to avoid obstacle and navigate to the target position.
37

3.1. Outline
Figure 3.1: Figure shows the outline of our approach. The red block represents the
components of the autonomous navigation, the yellow block represents the components
to acquire the obstacle recognition and the blue block represents the components of
the reinforcement learning.
38

3.1. Outline
In the figure 3.1, the connection between all the components to achieve the objec-
tive is shown. The blue colored arrow shows the respective communication channels,
Control channel which is used to send the AT-commands to the drone, Navigation
channel which is used to get the sensor data from the drone and Video channel which
is used to get the video frames from the camera of the drone the detail description of
the channels is defined in section 2.1.3, 2.1.3 and 2.1.3 respectively.
The red block elements in the figure are the components required for Localization
and Navigation of the drone, the sensor measurements from the drone and the video
frames from the drone are respectively given to the Extended Kalman Filter and the
Monocular SLAM which calibrates the estimation of the pose of the drone in the en-
vironment. This calibrated pose of the drone consist of the roll, pitch, yaw angles
and the flight altitude of the drone along with the velocities in all axis. The detail
description of the process is described in section 3.2.
The blue block elements in the figure are the main components of the reinforce-
ment learning algorithm, it consist of environment which is created with the help of the
drone pose estimated by the Extended Kalman Filter and the Obstacle classification
element, agent which calibrates the action to be taken with the given state and reward
associated with the state from the environment and finally the policy finder which
finds the best action that should be taken in the given state which are represented
as set of features which are defined in section 3.4.1. The set of action are defined in
section 3.4.1 and the associated reward for the given state is defined in section 3.4.1.
The detail description of the process is described in section 3.4.
The yellow block elements in the figure are the components required for the ob-
stacle recognition, the video frames received from the video channel of the drone are
used as the primary source of image, and the known feature of the obstacles which are
calibrated during the training phase are used. Using Feature matcher to match these
two set of features we receive an output of matched features. Finally classification of
these matched features are carried out to classify if the scene from the front camera
consist of obstacle or not. The description of the process of obstacle recognition is
described in section 3.3.
The element Calibrate Required Drone Pose in the figure 3.1, is required to cal-
culate the required position that the drone should be, in order to reach the target
positions defined and avoid the obstacles. This element requires the past drone pose
and the required action from the reinforcement agent to calculate the required drone
pose for obstacle avoidance and exploration of the target locations.
To visualize our system as a model of dynamic system the red text denotes the
respective, System input which is the command sent to the drone, System output which
is the sensor output of the drone, Measured output which is the drone pose estimated
by the Extended Kalman Filter, Reference which is the required pose that is calibrated
by the reinforcement learning and the Measured error which is the difference of the
measured output and the reference.
39

3.2. Localization and Navigation of the Drone
3.2 Localization and Navigation of the Drone
To achieve the task of localization of the drone and enable the drone to navigate, the
methodology used in autonomous camera-based navigation of a quadrocopter [23] is
applied. This methodology is based on three major components which are listed below:
• Monocular SLAM
For solving the SLAM problem, PTAM algorithm which is described in section
2.3.2 is applied to all the video frames, which enables it to estimate the drone’s
pose. Scale of the map is essential to navigate the quadcopter, define the flight
plan, calculating control commands, to fuse visual pose available from the PTAM
with the sensor values. While tracking the pose of the drone, scale of the map is
also estimated with the help of the sensors data provided by the AR.Drone like
ultrasound sensor values and horizontal speed.
• Extended Kalman Filter
For integrating all the sensor data provided from the drone, the pose estimation
provided by the PTAM and the effect of the control commands sent to the
drone extended Kalman filter (EKF) as described in section 2.6.2 is developed.
The use of the filter enhance the estimate of the drone’s pose and also gives
a good prediction of the future state when the command is sent to the drone,
compensating the delays which is faced in the communication process.
• PID controller
For navigation of the drone to given target position, estimated velocity of the
drone and estimate of the drone’s pose that is given by EKF is used to calculate
the required control commands using the PID controller which is described in
section 2.5.1. The approach allows the drone to navigate to the target position
with a speed of up to 2m/s and to accurately hover the given position with a
root mean squared error (RMSE) of only 8 cm [23].
Figure 3.2: Map window of the drone. The red points represent the keypoints, the
green line represents the path that the drone executed.The window also shows the
estimated drone pose its x,y,z position and the roll, pitch and yaw angle.
40

3.2.1 Implementation
In order to navigate a quadcopter, estimation of the pose as well as the velocity,
knowledge about the scale of the map and control signal that needs to be sent to the
quadcopter to reach the target position are essential. With the use of the methodology
[23] we are able to achive all the above mentioned necessities.
After the initialization process of PTAM which consist of flying the quadcopter 1m
up and 1m down, the methodology is able to estimate the pose, velocity of the quad-
copter in the environment. After the pose estimation is carried out, the methodology
provides the functionality of navigating the quadcopter to target location. The func-
tionality has been implemented and tested in [20] where the quadcopter can navigate
through predefined figures with good accuracy. To navigate through the predefined
target locations the PID controller is provided with the target location then the con-
troller calculates the required control signal that has to be sent to the quadcopter in
order to reach the target location with the help of the pose estimation provided by
the EKF. When the target location is reached the quadcopter holds this position until
there is another target to be reached.
Figure 3.3: Graphical User Interface of the node drone gui. The user interface consist
of the control source where we can select the means to control the drone like the
keyboard, joystick etc, it also provides us the information of the Stateestimation Status
of whether the PTAM tracking is good or bad, it provides the interface to takeoff, land,
toggle the camera of the drone.
41

3.2.2 Employed Software
Software ROS package tum ardrone which contains the implementation of [22], [21]
and [20] publication is used to achieve the objective of Localization and Navigation
of the drone. The package is developed for the AR.Drone and is licenced under GNU
General Public License Version 3, this package also includes PTAM which has its own
license. The package contains three ROS nodes which are as follows:
• drone stateestimate
This node is responsible to provide the estimate of the drone’s position. It
implements the previously described EKF for state estimation which includes
the dynamics of the drone, time delay compensation, scale estimation and to
integrate PTAM algorithm.
• drone autopilot
This node is the PID controller implementation for the drone. This node re-
quires the drone stateestimate node to be running. The node also provides
functionality like goto : xyzyaw where it sends the required command to the
drone to reach the required destination. It also provides normal functionality
like takeoff, land , PTAM initialization process.
• drone gui
This node is an graphical user interface to use the funcitonality of node drone autopilot
and drone stateestimate. The graphical interface contains information about
the network status, stateestimation status, autopilot status and can be used to
change the control from keyboard to joystick or choose autopilot
To reach the final goal position autonomously the node drone autopilot is con-
tinuously provided with command for the next action required. The next action that
has to be executed in order to reach the final goal is decided with the help of function
approximation and feature-based reinforcement learning which is described in section
2.2.5. The reinforcement learning is provided with the state of the drone like the
position and velocity from the node drone stateestimate to calculate the optimum
action that has to be taken to reach the final goal.
42

3.3. Obstacle Recognition
3.3 Obstacle Recognition
To achieve the task of Obstacle Recognition which enables the quadcopter to detect
obstacles we implement the Oreinted FAST and Rotated BRIEF (ORB) which is de-
scribed in section 2.4.1. This method of object recognition consist of Features from
accelerated segment test (FAST) for feature detection and Binary Robust Independent
Elementary Features (BRIEF) for feature description.
In order to classify object, first it is required to train the recognition algorithm
with known objects. The features extracted from these known objects are later used
to classify if the object is an obstacle or not. In order to match the features from
the known objects with the unknown objects, Brute-force descriptor matcher is used.
Brute-orce descriptor matcher finds the closest descriptor in the feature set of the un-
known object for each descriptor of the known feature set. This matcher returns the
matches found and the distance between the descriptor, lower the distance better is
the match between the descriptor. In our thesis we filtered the distance having length
greater than 40 leaving only good matches.
The rate at which the AR.Drone transmits the video frame from the camera is
18Hz, if all the frames that is recieved under goes the object recognition process there
is lag experienced in the process so in order to reduce the lag and the computational
cost the sampling rate for object recognition was choosen as 5Hz. Since the object
recognition algorithm also suffers from the false positive results we classify object as
an obstacle only after the same object was found in 5 consecutive frames or count of
matches found which have distance lower than 40 are more than 20.
Figure 3.4: The figure shows the time graph of the video frames from the camera. The
neglected video frames are shown in red due to computation of the object recognition
algorithm during the interval of 25ms. The video frames that are selected for the
object recognition are shown in blue and the corresponding object classified is shown
in green with Obj1. After object recognition, in order to remove the false positive
results we sample 5 classified object and if the all the 5 objects are of same class then
the object is classified as obstacle which is represented in the figure with orange color.
43

3.3. Obstacle Recognition
Figure 3.5: The figure shows the plot between number of features matched and distance
between the matched features.
In the figure 3.5 the graphs shows that there are false positive matches found which
are represented by red line in the first graph and blue line in the second graph. To
remove these false positive matches a threshold limit 40 was defined where the matches
below this distance can be regarded as good matches to classify object with more ac-
curancy. As the number of matches having distance below 40 increases the accurancy
of the classification also increases. In our algorithm if the number of matches having
distance below 40 is more than 20 the object is classified without classification of next
4 frames. This condition is also shown in figure 3.4 with the orange line representing
Obs2. But if the probability is low as in the case of the second graph we classify 4 more
frames to increase the probability which is shown in the figure 3.4 with the orange lie
representing Obs1.
Since the ORB descriptor suffers from the scaling as compared to other feature
descriptors we use this to our advantage to detect obstacle at certain distance from
the quadcopter. While training the classifer the scale of the image used is equal to the
scale of image when the quadcopter is one step size away from the obstacle i.e 0.5m.
After the obstacle is detected it sends a notification to the reinforcement learning
algorithm, which then sends the required action to avoid the obstacle.
44

3.4. Autonomous avoidance of obstacles
3.4 Autonomous avoidance of obstacles
In order to navigate autonomously in environment with obstacles and explore all the
required target loactions, the quadcopter must have proper planning algorithm to
navigate to the goal position and must also have knowledge of the obstacles for avoid-
ing collision. We use Function Approximation and Feature-Based Method (FAFBRL)
which is described in the section 2.2.5 to train the quadcopter to calibrate the optimal
action that has to be taken in the current state where the position of the quadcopter
and obstacle is known along with the position that the quadcopter must explore.
3.4.1 Initialization and Training
As previously stated in the section 2.2.5 the basis of FAFBRL are the following two
equations:
Qa
(s, a) = θa
1f1 + ... + θa
nfn (3.1)
θa
k = θa
k + α r + γmaxa Qa
(s , a ) − Qa
(s, a)
dQa
(s, a)
dθa
k
(3.2)
In our thesis we have initialized the learning rate (α) = 0.2 which represent the
step size, the discount rate (γ) = 0.8 which represent the rate of discount that has to
be performed for future rewards with respect to the current reward.
Features
Features which are responsible to represent the state of the quadcopter with there
respective values are:
• target position (f1): the predeﬁned position that the quadcopter must explore.
The constant value for this feature f1 = 1.
• obstacle in the path (f2): obstacle is found within the distance of 1m, in the
path the quadcopter planned in order to reach the target position. The constant
value for this feature f2 = 1 if the obstacle is found and f2 = 0 if obstacle is not
found.
• target position reached (f3): if there were no obstacle’s found near by i.e
within the distance of 1m and target location is within 1m, in the path the
quadcopter planned in order to reach the target position. The constant value for
this feature f3 = 1.
• distance to target position (f4): calculate the distance the quadcopter need
to travel in order to reach the nearest target position. The value is normalized
with the maximum distance that the quadcopter can travel so that the value lies
between 0 → 1.
45

These four features represents the state of the environment, each unique combination
of these features represents a diﬀerent state of the environment. In the training stage
the feature values f1, f2, f3, f4 all have their respective weights θ1, θ2, θ3, θ4 initially as-
signed to zero. As the number of iteration in the training stage increase these weights
start converging towards a constant value. After the weights have converged we can
use these weight values for calculating the optimal policy for the given state of the
environment.
Figure 3.6: In this ﬁgure drone is shown in the position 0,0 and two target positions
t1 and t2 are shown. The obstacle is also shown which is in front of the drone and
is within 1m distance. The green path are the path that the drone planned to move
in order to explore the target positions. The red path is the path in which collision
might occur so the drone ignores this path.
Actions
The list of actions (a) that the quadcopter can perform are as follows:
• Move Forward
When this action is performed the quadcopter moves 0.5m forward i.e 0.5m in
Y direction from the current estimated position.
• Move Backward
When this action is performed the quadcopter moves 0.5m backward i.e 0.5m in
-Y direction from the current estimated postion.
46

Figure 3.7: In this figure the drone moved to right from figure shown in 3.6. Now the
obstacle also changed the position so the drone cannot move right anymore so it takes
a new path to reach the target location.
• Move Right
When this action is performed the quadcopter moves 0.5m right i.e 0.5m in X
direction from the current estimated postion.
• Move Left
When this action is performed the quadcopter moves 0.5m left i.e 0.5m in -X
direction from the current estimated postion.
• Take Off
When this action is performed the quadcopter take offs and then the initializes
its position at x = 0, y=0,z=0 and yaw angle = 0
• Land
When this action is performed the quadcopter lands at the current position.
Exploration vs Exploitation
To overcome with the exploration vs exploitation problem which is stated in the section
2.2.1, we train our learning algorithm to choose legal actions (actions that doesn’t lead
to collision with the obstacles) stocastically. Among the legal actions which the drone
can perform, to prevent the exploitation where the action which has the maximum Q
value is always choosen we use Randomized Strategies where the action with maximum
Q value is choosen but with the probability of 0.8.
47

Rewards
The reward associated with each state of the quadcopter which are represented by the
features f1, f2, f3, f4are defined as follows:
• each step of movement that the quadcopter makes, which in our case 0.5m, it
gets negative reward of −1
• if the quadcopter reaches the location that needs to be explored (target location)
it gets a reward of 10
• if the quadcopter reaches the target location and there are no more location that
needs to be explored it gets a reward of 500
• if the quadcopter collides with the obstacle then it gets negative reward of −500
3.4.2 Simulation of training
In order to train the quadcopter to navigate without collision with the obstacles in the
real world scenario the training phase must include collision of the quadcopter with the
obstacles which might ruin the quadcopter itself. When the quadcopter is in flight and
after the collision it might stop the rotors and might fall straight to the ground which
might cause malfunction of the components of the quadcopter. In order to protect and
the quadcopter from these possible accidents and to train the quadcopter a simulation
environment is necessary.
A simulator was developed for the simulation of the training phase where the algo-
rithm calculates the required weights of the features which represents the environment.
The simulation consist of rewards associated with each state of the quadcopter, ac-
tions which the quadcopter can perform , map of the environment, obstacles and its
random actions. With the help of simulation environment the iteration required for
the training phase can be performed rapidly.
Simulation Capability
Simulator developed was capable to simulate the following features:
• different map could be added manually with different orientation of the static
obstacles like the wall, door etc,
• different number of target location where the quadcopter must explore in the
map,
• define the location of the the target location,
• different number of dynamic obstacle could be introduced,
• define the location of the the dynamic obstacle,3
• define the starting position of the quadcopter,
• calculate the total reward that the quadcopter gained
48

Figure 3.8: Simulator with the given map, static obstacles, target positions and reward
gained during the current training episode.
49

In the simulator the quadcopter is denoted by a cicle. The target postions are
denoted by another symbol as in the figure 3.8. In the figure 3.8, the top left figure
shows the starting position of the quadcopter and its reward return as 0. Top right
figure shows the position of the quadcopter after target position is reached which adds
a reward of 10 and two step which includes reward of −1 each resulting of reward of
8. Bottom left figure shows the position where the quadcopter just reached the target
postion and is exploring the position and after the exploration is complete will gain
10 reward. Finally in bottom right the figure shows the quadcopter completed the
exploration of all the target position in the given map so gain the reward of 500.
Figure 3.9: Simulator with the given map, dynamic obstacles, target positions and
reward gained during the current training episode. Top figures represents the scenario
with one dynamic obstacle which is represented with a blue circle. Bottom figures
represents the scenrio with two dynamic obstacles.
50

Calibrated Value
After numerous training iteration in the simulator the calibrated value of weights of
the features were as follows:
Feature Weight Weight Value
Target position f1 θ1 62.2
Obstacle in the path f2 θ2 -20
Target position reached f3 θ3 83.3
Distance to target position f4 θ4 -1.2
3.4.3 Software
In order to create the simulator and also enable it to communicate with the ROS nodes
which are deﬁned to control and get information from the quadcopter a ROS packet
was created. After the convergence of the weights of the features these values can be
used in the reinforcement algorithm to navigate throught the target position avoiding
the obstacles. The action which the algorithm calibrate as the optimal action that has
to be taken with the given state in the environment is then converted to the action
which are feasible with the communication with the ROS node drone autopilot.
51

thesis-2

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to thesis-2

Similar to thesis-2 (20)

thesis-2