Graduation project Book (Self-Driving Car)

SELF-DRIVING CAR
Faculty of engineering-Suhag university

1 | P a g e
Suhag University
Faculty of Engineering
Communication and Electronics Department
Graduation Project
Self-Driving Car
BY:
Khaled Mohamed
Ahmed Mohamed
Ali Mohamed
Rana Mohamed
Abdelbaset Abdelmoaty
Ahmed Rashad
Supervised by:
DrAhmed Soliman
DrMostafa salah

2 | P a g e
First and foremost, we thank Allah Almighty who paved path for us, in achieving the
desiredgoal.Wewouldliketoexpressoursinceregratitudetoourmentor DrAhmed
Soliman& MostafaSalah for the continuous support of our studies and research, for
his patience, motivation, enthusiasm, and immense knowledge. His guidancehelped
usinallthetimeforachievingthegoalsofourgraduationproject.Wecouldnothave
imagined having a better advisor and mentor for our graduation project. Our thanks
andappreciationsalsogotoourcolleaguesindevelopingtheprojectandpeoplewho
havewillinglyhelpedusoutwiththeirabilities.Finally,anhonorablementiongoesto
ourparents,brothers,sistersandfamilies.Wordscannotexpresshowgratefulweare.
Yourprayerforuswaswhatsustainedusthusfar.Withouthelpsoftheparticularthat
mentionedabove,wewouldfacemanydifficultieswhiledoingthis.
ACKNOWLEDGMENT

3 | P a g e
Whetheryoucallthemself-driving,driverless,automated,autonomous,thesevehicles
are on the move. Recent announcements by Google (which drove over 500,000 miles
on its original prototype vehicles) and other major automakersindicate the potential
for development in this area. Driverless cars are often discussed as “disruptive
technology”withtheabilitytotransformtransportationinfrastructure,expandaccess,
and deliver benefits to a variety of users. Some observers estimatelimited availability
of driverless cars by 2020 with wide availability to the public by 2040. The following
sectionsdescribethedevelopmentandimplementationofanautonomouscarmodel
withsomefeatures.Itprovidesahistoryofautonomouscarsanddescribestheentire
developmentprocessofsuchcars. Thedevelopmentofourprototypewascompleted
through the use of two controllers;Raspberry pi and Arduino. The main parts of our
model includethe Raspberry pi, Arduino controller board, motors, Ultrasonic sensors,
Infraredsensors,opticalencoder,X-beemodule,andlithium-ionbatteries.
Italsodescribesspeedcontrolofthecarmotionbythemeansofaprocessknownas
PIDtuningtomakecorrectadjustmentstothebehaviorof theVehicle.
ABSTRAC
T

4 | P a g e
Chapter 1: Introduction
1.1 History.
1.2 Why autonomous car is important.
1.3 What Are Autonomous and Automated Vehicles.
1.4 Advanced Driver Assistance System (ADAS).
1.5 Project description.
1.6 Related work.
Chapter 2: Car Design and Hardware
2.1 System Design.
2.2 Chassis.
2.3 Raspberry Pi 3 Model B.
2.4 Arduino Uno.
2.5 Ultrasonic Sensor.
2.6 servo Motor.
2.7 L298N Dual H-Bridge Motor Driver and DC Motors.
2.8 the Camera Module.
Chapter 3: Deep learning
3.1 Introduction.
3.2 What is machine learning.
3.3 Representation of Neural Networks.
3.4 Training a neural network.
3.6 Problem setting in image recognition.
3.7 Convolutional Neural Network (CNN).
3.8 Deep learning-based autonomous driving.
3.9 Conclusion.
Table of content

5 | P a g e
Chapter 4: Object Detection
4.1 Introduction.
4.2 General object detection framework.
4.3 Region proposals.
4.4 Steps of how the NMS algorithm works.
4.5 Precision-Recall Curve (PR Curve).
4.6 Conclusion.
4.7 Region-Based Convolutional Neural Networks (R-CNNs) [high mAP andlow
FPS].
4.8 Fast R-CNN
4.9 Faster R-CNN
4.10 Single Shot Detection (SSD) [Detection Algorithm Used In Our Project]
4.11 Base network.
4.12 Multi-scale feature layers.
4.13 (YOLO) [high speed but low mAP][5].
4.14 What Colab Offers You?
Chapter 5: Transfer Machine Learning
5.1 Introduction.
5.1.1 Definition and why transfer learning?
5.1.2 How transfer learning works.
5.1.3 Transfer learning approaches.
5.2 Detecting traffic signs and pedestrians.
5.2.1 Model selection (most bored).
5.3 Google’s edge TPU. What? How?
why?
Chapter 6: Lane Keeping System
6.1 introduction
6.2 timeline of available systems

6 | P a g e
6.3 current lane keeping system in market
6.4 overview of lane keeping algorithms
6.5 perception: lane Detection
6.6 motion planning: steering
6.7 lane keeping via deep learning
6.8 Google Colab for training
Chapter 7: System Integration
7.1 introduction.
7.2 connection between laptop and raspberry-pi.
7.3 connection between raspberry-pi and Arduino.
Chapter 8: software for connections
8.1 Introduction
8.2 Network
8.3 laptop to Raspberry-pi
8.4 Arduino to Raspberry-pi
References

7 | P a g e
Figure 1-1 Google car.
Figure 1-2 progression of automated vehicle technologies.
Figure 2.1 System block
Figure 2.2 Top View of Chassis.
Figure 2.3 Raspberry Pi 3 model B.
Figure 2.4 Arduino Uno Board
Figure 2.5 connection of Arduino and Ultrasonic
Figure 2.6 Servo motor connection with Arduino
Figure 2.5 connection of Arduino and Ultrasonic
Figure 2.6 Turn Robot Right
Figure 2.7 Rurn Robot left
Figure 2.7 Pulse Width Modulation.
Figure 2.8 Controlling DC motor using MOSFET.
Figure 2.9 H-Bridge DC Motor Control.
Figure 2.10 L298N Motor Driver Specification.
Figure 2.11 L298N Motor Driver.
Figure 2.12 L298 control pins.
Figure 2.13 Arduino and L298N connection.
Figure 2.14 Camera module
Figure 2.15 Raspberry pi.
Figure3.1 Learning Algorithm.
Figure3-2 General Object Recognition
Figure3-3 Conventional machine learning and deep learning.
Figure3-4 Basic structure of CNN.
Figure3-5 Network structure of AlexNet and Kernels.
Figure3-6 Application of CNN to each image recognitiontask.
Figure3-7 Faster R-CNN structure.
Figure3-8 YOLO structure and examples of multiclass object detection.
Figure3-9 Fully Convolutional Network (FCN) Structure.
Figure3-10 Example of PSPNet-based Semantic Segmentation Results (cited from Reference).
Figure3-11 attention maps of CAM and Grad-CAM. (cite from reference).
Figure3-12 Regression-type Attention Branch Network. (cite from reference).
Figure3-13 Attention map-based visual explanation for self-driving.
Figure4-1 Classification and Object Detection.
Figure4-2 Low and Hight objectness score.
Figure4.3 An example of selective search applied to an image. A threshold can be tuned in the SS
algorithm to generate more or fewer proposals.
Figure4.4 Class prediction.
Figure4-5 Predictions before and after NMS.
Figure4-6 Intersection Over Union (EQU).
Figure4-7 Ground-truth and predicted box.
Figure4-8 PR Curve.
Figure4-9 Regions with CNN features.
Figure4-10 Input.
List of Figures

8 | P a g e
Figure4-11 Output.
Figure4.12 Fast R-CNN.
Figure4-14 The RPN classifier predicts the objectness score which is the probability of an image
containing an object (foreground) or a background.
Figure 4-15 Anchor boxes.
Figure4-16 R-CNN, Fast R-CNN, Faster R-CNN.
Figure4-17 Comparison between R-CNN, Fast R-CNN, Faster R-CNN.
Figure4-18 SSD architecture.
Figure4.19 SSD Base Network looks at the anchor boxes to find features of a
boat. Green (solid) boxes indicate that the network has found boat features. Red (dotted) boxes indicate
no boat features.
Figure4-20 Right image - lower resolution feature maps detect larger scale objects. Left image – higher
resolution feature maps detect smaller scale objects.
Figure4-21 the accuracy with different number of feature map layers.
Figure4-22 Architecture of the multi-scale layers.
Figure4-23 YOLO splits the image into grids, predicts objects for eachgrid, then use NMS to finalize
predictions.
Figure4-24 YOLOv3 workflow.
Figure4-25 YOLOV3 Output bounding boxes.
Figure4-26 neural network architecture.
Figure4-27 Python. Figure4-28 Open CV.
Figure4-29 TensorFlow.
Figure5-1 traditional ML VS TrasferLearning.
Figure5-2 Extracted features.
Figure5-3 Feature maps.
Figure5-4 CNN Architecture Diagram, Hierarchical Feature Extraction in stages.
Figure5-5 features start to be more specific.
Figure 5-6 Dataset that is different from the source dataset.
Figure5-7 ImageNet Challenge top error.
Figure5-8 Tensorflow detection model zoo.
Figure5-9 COCO-trained models.
Figure5-10 Download pre-trained model.
Figure5-11 Part of config file the contains information about image resizer to make image suitable to
CNN make it (300x300) and architecture of box predictor CNN which include regularization and drop
out to avoid overfitting.
Figure5-11 Part of config file indicate the batch size, optimizer type and learning rate (which is vary in
this case).
Figure5-12 Part of config file the contains information about image resizer to make image suitable to
CNN make it (300x300) and architecture of box predictor CNN which include regularization and drop
out to avoid overfitting.
Figure5-13 mAP (top left), a measure of precision, keeps on increasing.
Figure5-14 Google edge TPU.
Figure5-15 Quantization.
Figure5-16 Accuracy of non-quantized model’s vs quantized models.
Figure6.1 the lane departure warning algorithm.
Figure6.2 Image in HSV Color.
Figure6.4 Hue in 0-360 degrees scale.
Figure6.3 OpenCV command.
Figure6.5 code to lift Blue out via OpenCV, and rendered mask image.
Figure6.6 OpenCV recommends.
Figure6.7 Edges of all Blue Areas.
Figure6.8 Cropped Edges.

9 | P a g e
Figure6.9 CNN architecture.
Figure6.10 Method Training.
Figure6.11 Angles distribution from Data acquisition.
Figure 6.12 mount Drive to colab
Figure 6.13 python package for training.
Figure 6.14 Loading our data from drive.
Figure 6.15 train and test data.
Figure 6.16 load model.
Figure 6.17 Summary Model.
Figure6.18 Graph of training and validation loss.
Figure6.19 Result of our model on our data.
Figure7-1 Diagram for connections.
Figure7.2 Server.
Figure7.3 Client.
Figure7.4 Recording stream on connection file.
Figure7.5 Reading from connection file and Split frames on server side.
Figure7.6 Terminate connection in two sides.
Figure7.7 Operation of TCP.
Figure7.8 Graphical representation of the i2c bus.
Figure7.9 Structure of a base i2c message.
Figure7-10 I2c code in raspberry-pi.
Figure7.11 I2c code in Arduino.
Figure 8.1 Block diagram of system.
Figure 8.2 Raspberry-pi.
Figure 8.3 Then use putty or VNC to access the pi and control it remotely from its terminal.
Figure 8.4 VNC Viewer.
Figure 8.5 Putty Viewer.
Figure 8.6 Arduino board.
Figure 8.7 Access point.
Figure 8.8 Hardware schematics.

10 | P a g e
N Name C
1 4 Wheel Robot Smart Car Chassis Kits Car Model with Speed
Encoder for Arduino.
1
2 Servo Motor Standard (120) 6 kg.cm Plastic Gears "FS5106B" 1
3 RS Raspberry pi Camera V2 Module Board 8MP Webcam Video. 1
4 Arduino UNO Microcontroller Development Board + USB Cable 1
5 Raspberry Pi 3 Model B+ RS Version UK Version 1
6 Micro SD 16GB-HC10 With Raspbian OS for Raspberry PI 1
7 USB Cable to Micro 1.5m 1
8 Motor Driver L298N 1
9 17 Values 1% Resistor Kit Assortment, 0 Ohm-1M Ohm 1
10 Mixed Color LEDs Size 3mm 10
11 Ultrasonic Sensor HC-04 + Ultrasonic Sensor holder 3
12 9V Battery Energizer Alkaline 1
13 9V Battery Clip with DC Plug 1
14 Lipo Battery 11.1V-5500mAh – 35C 1
15 Wires 20 cm male to male 20
16 Wires 20 cm male to female 20
17 Wires 20 cm female to female 20
18 Breadboard 400 pin 1
19 Breadboard 170 pin (White color) 1
20 Rocker switch on/off with lamp red ( KDC 2 ) 3
21 Power bank 10000 mA h 1
List of components

11 | P a g e
Introduction
Chapter 1

12 | P a g e
1.1 History
1930s
An early representation of the autonomous car was Norman Bell Geddes's Futurama
exhibit sponsored by General Motors at the 1939 World's Fair, which depicted electric
cars powered by circuits embedded in the roadway and controlled by radio.
1950s
In 1953, RCA Labs successfully built a
miniature car that was guided and controlled
by wires that were laid in a pattern on a
laboratory floor. The system sparked the
imagination of Leland M. Hancock, traffic
engineer in the Nebraska Department of
Roads, and of his director, L. N. Ress, state
engineer. The decision was made to
experiment with the system in actual
highway.
installations. In 1958, a full size system was successfully demonstrated by RCA Labs
and the State of Nebraska on a 400-foot strip of public highway just outside Lincoln,
Neb.
1980s
In the 1980s, a vision-guided Mercedes-Benz robotic van, designed by Ernst
Dickmanns and his team at the Bundeswehr University Munich in Munich, Germany,
achieved a speed of 39 miles per hour (63 km/h) on streets without traffic.
Subsequently, EUREKA conducted the €749 million Prometheus Project on
autonomous vehicles from 1987 to 1995.
1990s
In 1991, the United States Congress passed the ISTEA Transportation Authorization
bill, which instructed USDOT to "demonstrate an automated vehicle and highway
system by 1997." The Federal Highway Administration took on this task, first with a

13 | P a g e
series of Precursor Systems Analyses and then by establishing the National
Automated Highway System Consortium (NAHSC). This cost-shared project was led
by FHWA and General Motors, with Caltrans, Delco, Parsons Brinkerhoff, Bechtel,UC-
Berkeley, Carnegie Mellon University, and Lockheed Martin as additionalpartners.
Extensive systems engineering work and research culminated in Demo '97 on I-15 in
San Diego, California, in which about 20 automated vehicles, including cars, buses,
and trucks, were demonstrated to thousands of onlookers, attracting extensive
media coverage. The demonstrations involved close-headway platooning intended to
operate in segregated traffic, as well as "free agent" vehicles intended to operate in
mixed traffic.
2000s
The US Government funded three military efforts known as Demo I (US Army), Demo
II (DARPA), and Demo III (US Army). Demo III (2001) demonstrated the ability of
unmanned ground vehicles to navigate miles of difficult off-road terrain, avoiding
obstacles such as rocks and trees. James Albus at the National Institute for Standards
and Technology provided the Real-Time Control System which is a hierarchical
control system. Not only were individual vehicles controlled (e.g. Throttle, steering,
and brake), but groups of vehicles had their movements automatically coordinated in
response to high level goals. The Park Shuttle, a driverless public road transport
system, became operational in the Netherlands in the early 2000s.In January 2006,
the United Kingdom's 'Foresight' think-tank revealed a report which predicts RFID-
tagged driverless cars on UK's roads by 2056 and the Royal Academy of Engineering
claimed that driverless trucks could be on Britain's motorways by 2019.
Autonomous vehicles have also been used in mining. Since December 2008, Rio
Tinto Alcan has been testing the Komatsu Autonomous Haulage System – the world's
first commercial autonomous mining haulage system – in the Pilbara iron ore minein
Western Australia. Rio Tinto has reported benefits in health, safety, and productivity.
In November 2011, Rio Tinto signed a deal to greatly expand its fleet of driverless
trucks. Other autonomous mining systems include Sandvik Automine’s underground
loaders and Caterpillar Inc.'s autonomous hauling.
In 2011
the Freie Universität Berlin developed two autonomous cars to drive in the inner city
traffic of Berlin in Germany. Led by the AUTONOMOS group, the two vehicles Spirit
of Berlin and made in Germany handled intercity traffic, traffic lights and
roundabouts between International Congress Centrum and Brandenburg Gate. It was
the first car licensed for autonomous driving on the streets and highways in Germany

14 | P a g e
and financed by the German Federal Ministry of Education and Research.
The 2014 Mercedes S-Class has options for autonomous steering, lane keeping,
acceleration/braking, parking, accident avoidance, and driver fatigue detection, in
both city traffic and highway speeds of up to 124 miles (200 km) per hour.
Released in 2013, the 2014 Infiniti Q50 uses cameras, radar and other technology to
deliver various lane-keeping, collision avoidance and cruise control features. One
reviewer remarked, "With the Q50 managing its own speed and adjusting course, I
could sit back and simply watch, even on mildly curving highways, for three or more
miles at a stretch adding that he wasn't touching the steering wheel or pedals.
Although as of 2013, fully autonomous vehicles are not yet available to the public,
many contemporary car models have features offering limited autonomous
functionality. These include adaptive cruise control, a system that monitors
distances to adjacent vehicles in the same lane, adjusting the speed with the flow of
traffic lane which monitors the vehicle's position in the lane, and either warns the
driver when the vehicle is leaving its lane, or, less commonly, takes corrective
actions, and parking assist, which assists the driver in the task of parallel parking
In 2013
on July 12, VisLab conducted another pioneering test of autonomous vehicles,
during which a robotic vehicle drove in downtown Parma with no human control,
successfully navigating roundabouts, traffic lights, pedestrian crossings and other
common hazards.
1.2 Why autonomous car is important
1.2.1 Benefits of Self-Driving Cars
1. Fewer accidents
The leading cause of most automobile accidents today is driver error. Alcohol,
drugs, speeding, aggressive driving, over-compensation, inexperience, slow
reaction time, inattentiveness, and ignoring road conditions are all
contributing factors. Given some 40 percent of accidents can be traced to the
abuse of drugs and or alcohol, self-driving cars would practically eliminate
those accidents altogether.

15 | P a g e
2. Decreased (or Eliminated) Traffic Congestion
One of the leading causes of traffic jams is selfish behavior among drivers. It
has been shown when drivers space out and allow each other to move freely
between lanes on the highway, traffic continues to flow smoothly,regardless
of the number of cars on the road.
3. Increased Highway Capacity
There is another benefit to cars traveling down the highway and
communicating with one another at regularly spaced intervals. More cars
could be on the highway simultaneously because they would need to occupy
less space on the highway
4. Enhanced Human Productivity
Currently, the time spent in our cars is largely given over to simply gettingthe
car and us from place to place. Interestingly though, even doing nothing at all
would serve to increase human productivity. Studies have shown taking short
breaks increase overall productivity.
You can also finish up a project, type a letter, monitor the progress of your
kids, schoolwork, return phone calls, take phone calls safely, textuntil your
heart’s content, read a book, or simply relax and enjoy the ride.
5. Hunting for Parking Eliminated
Self-driving cars can be programmed to let you off at the front door of your
destination, park themselves, and come back to pick you up when you
summon them. You’re freed from the task of looking for a parking space,
because the car can do it all.
6. Improved Mobility for Children, The Elderly, And the Disabled
Programming the car to pick up people, drive them to their destination and
Then Park by themselves, will change the lives of the elderly and disabled by
providing them with critical mobility.
7. Elimination of Traffic Enforcement Personnel
If every car is “plugged” into the grid and driving itself, then speeding, —
along with stop sign and red light running will be eliminated. The cop on the
side of the road measuring the speed of traffic for enforcement purposes?
Yeah, they’re gone. Cars won’t speed anymore. So, no need to Traffic
Enforcement Personnel.
8. Higher Speed Limits
Since all cars are in communication with one another, and they’re all
programmed to maintain a specific interval between one another, and they all
know when to expect each other to stop and start, the need to accommodate
human reflexes on the highway will be eliminated. Thus, cars can maintain

16 | P a g e
higher average speeds.
9. Lighter, More Versatile Cars
The vast majority of the weight in today’s cars is there because of the need to
incorporate safety equipment. Steel door beams, crumple zones and the need
to build cars from steel in general relate to preparedness for accidents. Self-
driving cars will crash less often, accidents will be all but eliminated, and so
the need to build cars to withstand horrific crashes will be reduced. This
means cars can be lighter, which will make them more fuel-efficient.
1.3 What Are Autonomous and Automated Vehicles
Technological advancements are creating a continuum between conventional, fully
human-driven vehicles and automated vehicles, which partially or fully drive
themselves and which may ultimately require no driver at all. Within this continuum
are technologies that enable a vehicle to assist and make decisions for a human
driver. Such technologies include crash warning systems, adaptive cruise control
(ACC), lane keeping systems, and self-parking technology.
•Level 0 (no automation):
The driver is in complete and sole control of the primary vehicle functions (brake,
steering, throttle, and motive power) at all times, and is solely responsible for
monitoring the roadway and for safe vehicle operation.
•Level 1 (function-specific automation):
Automation at this level involves one or more specific control functions; if multiple
functions are automated, they operate independently of each other. The driver has
overall control, and is solely responsible for safe operation, but can choose to cede
limited authority over a primary control (as in ACC); the vehicle can automatically
assume limited authority over a primary control (as in electronic stability control); or
the automated system can provide added control to aid the driver in certain normal
driving or crash-imminent situations (e.g., dynamic brake support in emergencies).
•Level 2 (combined-function automation):
This level involves automation of at least two primary control functions designed to
work in unison to relieve the driver of controlling those functions. Vehicles at this
level of automation can utilize shared authority when the driver cedes active primary
control in certain limited driving situations. The driver is still responsible for
monitoring the roadway and safe operation, and is expected to be available for
control at all times and on short notice. The system can relinquish control with no
advance warning and the driver must be ready to control the vehicle safely.

17 | P a g e
•Level 3 (limited self-driving automation):
Vehicles at this level of automation enable the driver to cede full control of all
safety-critical functions under certain traffic or environmental conditions, and in
those conditions to rely heavily on the vehicle to monitor for changes in those
conditions requiring transition back to driver control. The driver is expected to be
available for occasional control, but with sufficiently comfortable transition time
•Level 4 (full self-driving automation):
The vehicle is designed to perform all safety-critical driving functions and monitor
roadway conditions for an entire trip. Such a design anticipates that the driver will
provide destination or navigation input, but is not expected to be available for control
at any time during the trip. This includes both occupied and unoccupied vehicles
Our project can be considered as prototype systems of level 4 or level 5
1.4 Advanced Driver Assistance System (ADAS)
A rapid growth has been seen worldwide in the development of AdvancedDriver
Assistance Systems (ADAS) because of improvements in sensing, communicating
and computing technologies. ADAS aim to support drivers by either providing
warning to reduce risk exposure, or automating some of the control tasks to
relieve a driver from manual control of a vehicle. From an operational point of
view, such systems are a clear departure from a century of automobile
development where drivers have had control of all driving tasks at all times.
ADAS could replace some of the human driver decisions and actions with precise
machine tasks, making it possible to eliminate many of the driver errors which
could lead to accidents, and achieve more regulated and smoother vehicle
control with increased capacity and associated energy and environmental
benefits.
Autonomous ADAS systems use on-board equipment, such as ranging
sensors and machine/computer vision, to detect surrounding environment.
The main advantages of such an approach are that the system operation does not
rely on other parties and that the system can be implemented on the current road
infrastructure. Now many systems have become available on the market including

18 | P a g e
Adaptive Cruise Control (ACC), Forward Collision Warning (FCW) and Lane Departure
Warning systems, and many more are under development. Currently, radar sensors
are widely used in the ADAS applications for obstacle detection. Compared with
optical or infrared sensors, the main advantage of radar sensors is that they perform
equally well during day time and night time, and in most weather conditions. Radar
can be used for target identification by making use of scattering signature
information.
It is widely used in ADAS for supporting lateral control such as lane departure
warning systems and lane keeping systems.
Currently computer vision has not yet gained a large enough acceptance in
automotive applications. Applications of computer vision depend much on the
capability of image process and pattern recognition (e.g. artificial intelligence). The
fact that computer vision is based on a passive sensory principle creates detection
difficulties in conditions with adverse lighting or in bad weather situations.
1.5 Project description
1.5.1 Auto-parking
The aim of this function is to design and implement self-parking car system that
moves a car from a traffic lane into a parking spot through accurate and realistic
steps which can be applied on a real car.
1.5.2 Adaptive cruise control (ACC)
Also radar cruise control, or traffic-aware cruise control is an optional cruise control
system for road vehicles that automatically adjusts the vehicle speed to maintain a
safe distance from vehicles ahead. It makes no use of satellite or roadside
infrastructures nor of any cooperative
support from other vehicles. Hence control is imposed based on sensor information
from on-board sensors only.
1.5.3 Lane Keeping Assist
It is a feature that in addition to Lane Departure Warning System automatically
takes steps to ensure the vehicle stays in its lane. Some vehicles combine adaptive
cruise control with lane keeping systems to provide additional safety. A lane
keeping assist mechanism can either reactively turn a vehicle back into the lane if
it starts to leave or proactively keep the vehicle in the center of the lane. Vehicle
companies often use the term "Lane Keep(ing) Assist" to refer to both reactive
Lane Keep Assist (LKA) and proactive Lane Centering Assist (LCA) but the terms are
beginning to be differentiated.
1.5.4 Lane departure

19 | P a g e
Our car moves using adaptive cruise control according to distance of front vehicle
. If front vehicle is very slow and will cause our car to slow down the car will start to
check the lane next to it and then depart to the next lane in order to speed up again.
1.5.5 Indoor Positioning system
An indoor positioning system (IPS) is a system to locate objects or people inside a
building using radio waves, magnetic fields, acoustic signals, or other sensory
information collected by mobile devices. There are several commercial systems on
the market, but there is no standard for an IPS system.
IPS systems use different technologies, including distance measurement to nearby
anchor nodes (nodes with known positions, e.g., Wi-Fi access points), magnetic
positioning, dead reckoning. They either actively locate mobile devices and tags or
provide ambient location or environmental context for devices to get sensed. The
localized nature of an IPS has resulted in design fragmentation, with systems making
use of various optical, radio, or even acoustic technologies.
1.6 Related work
The appearance of driverless and automated vehicle technologies offers
enormous opportunities to remove human error from driving. It will make
driving easier, improve road safety, and ease congestion. It will also enable
drivers to choose to do other things than driving during the journey. It is
the first driverless electric car prototype built by Google to test self-driving
car project. It looks like a Smart car, with two seats and room enough for a
small amount of luggage
Figure 1-1 Google car.

20 | P a g e
It operates in and around California, primarily around the Mountain View area
where Google has its headquarters.
It moves two people from one place to another without any user interaction. The car
is called by a smartphone for pick up at the user’s location with the destination set.
There is no steering wheel or manual control, simply a start button and a big red
emergency stop button. In front of the passengers there is a small screen showing the
weather and the current speed. Once the journey is done, the small screen displays a
message to remind you to take your personal belongings. Seat belts are also provided
in car to protect the passengers from the primary systems fails; plus, that emergency
stop button that passengers can hit at any time.
Powered by an electric motor with around a 100-mile range, the car uses a
combination of sensors and software to locate itself in the real world combined with
highly accurate digital maps. A GPS is used, just like the satellite navigation systems in
most cars, lasers and cameras take over to monitor the world around the car, 360-
degrees.
The software can recognize objects, people, cars, road marking, signs and traffic
lights, obeying the rules of the road. It can even detect road works and safely
navigate around them.
The new prototype has more sensors fitted to it that can see further (up to 600
feet in all directions)
The simultaneous development of a combination of technologies has brought
about this opportunity. For example, some current production vehicles now
feature adaptive cruise control and lane keeping technologies which allow the
automated control of acceleration, braking and steering for periods of time on
motorways, major A-roads and in congested traffic. Advanced emergency braking
Systems automatically apply the brakes to help drivers avoid a collision. Self-parking
systems allow a vehicle to parallel or Reverse Park completely hands free.
Developments in vehicle automation technology in the short and medium term will
move us closer to the ultimate scenario of a vehicle which is completely “driverless”.

21 | P a g e
Figure 1-2 progression of automated vehicle technologies
VOLVO autonomous CAR
semi-autonomous driving features:
sensors can detect lanes and a car in front of it.
Button in the steering wheel to let the system know I want it to use
Adaptive Cruise Control with Pilot Assist.
If the XC90 lost track of the lanes, it would ask the driver to handle steering duties
with a ping and a message in the dashboard. This is called the Human-machine
interface.
BMW autonomous CAR
A new i-Series car will include forms of automated driving and
digital connectivity most likely Wi-Fi, high-definition digital
maps, sensor technology, cloud technology and artificial
intelligence.

22 | P a g e
Nissan autonomous CAR
Nissan vehicles in the form of Nissan’s Safety Shield-inspired
technologies.
These technologies can monitor a nearly 360-degree view around a
vehicle
for risks, offering warnings to the driver and taking action to help
avoid crashes if necessary.

23 | P a g e
Car Design and Hardware
Chapter 2

24 | P a g e
2.1 System Design
Figure 2.1 System block
Figure 2.1 shows the block diagram of the system. The ultrasonic sensor, servo motor
and L298N motor driver are connected to Arduino Uno, while the Raspberry Pi Camera
was connected to the camera module port in Raspberry Pi 3. The Laptop and
Raspberry Pi are connected together via TCP protocol. The Raspberry Pi and Arduino
are connected via I2
C protocol (at chapter 6).
Raspberry Pi 3 and Arduino Uno was powered by a 5V power bank and the DC
motor powered up by a 7.2V battery. The DC motor was controlled by L298N motor
driver but Arduino send the control signal to the L298N motor driver and control the
DC motor to turn clockwise or anti clockwise. The servo motor is powered up by 5V
on board voltage regulator (red arrow) at L298N motor driver from 7.2V battery. The
ultrasonic sensor is powered up by 5V from Arduino 5V output pin.
The chassis is the body of the RC car where all the components, webcam, battery, power
bank, servo motor, Raspberry Pi 3, L298N motor driver and ultrasonic sensor are mounted on
the chassis. The chassis is two plates of were cut by laser cutting machine. We added 2 DC
motors which attached to the back wheels and Servo attached to the front wheels will use to
turn the direction of the RC car.
2.2 Chassis

25 | P a g e
Figure 2.2 Top View of Chassis.
2.3 Raspberry Pi 3 Model B
Figure 2.3 Raspberry Pi 3 model B.
The Raspberry Pi is a low cost, credit-card sized computer that plugs into a computer

26 | P a g e
monitor or TV, and uses a standard keyboard and mouse. It is a capable little device
that enables people of all ages to explore computing, and to learn how to program in
languages like Scratch and Python. It’s capable of doing everything you’d expect a
desktop computer to do, from browsing the internet and playing high-definition
video, to making spreadsheets, word-processing, and playing games. The Raspberry
Pi has the ability to interact with the outside world, and has been used in a wide array
of digital maker projects, from music machines and parent detectors to weather
stations and tweeting birdhouses with infra-red cameras. We want to see the
Raspberry Pi being used by kids all over the world to learn to program and understand
how computers work (1). For more details visit www.raspberrypi.org .
In our project, raspberry-pi is used to connect the camera, Arduino and laptop
together. Raspberry-pi sends the images taken by raspberry-pi camera to laptop to
make processing on it to make a decision second, then sends orders to Arduino to
control the servo and motors. The basic specification of Raspberry-pi 3 model B in the
next Table:
Raspberry Pi 3 Model B
Processor Chipset Broadcom BCM2837 64Bit ARMv7 Quad Core Processor
powered Single Board Computer running at 1250MHz
GPU VideoCore IV
Processor Speed QUAD Core @1250 MHz
RAM 1GB SDRAM @ 400 MHz
Storage MicroSD
USB 2.0 4x USB Ports
Power Draw/
voltage
2.5A @ 5V
GPIO 40 pin
Ethernet Port Yes
Wi-Fi Built in

27 | P a g e
Bluetooth LE Built in
Figure 2.3 Raspberry-pi 3 model B pins layout.
2.4 Arduino Uno
Figure 2.4 Arduino Uno Board
Arduino is an open-source electronics platform based on easy-to-use hardware and
software. Arduino boards are able to read inputs - light on a sensor, a finger on a
button, or a Twitter message - and turn it into an output - activating a motor, turning
on an LED, publishing something online. You can tell your board what to do by sending
a set of instructions to the microcontroller on the board. To do so you use the Arduino
programming language (based on Wiring), and the Arduino Software (IDE), based on

28 | P a g e
Processing. (2)
Arduino Uno is a microcontroller board based on 8-bit ATmega328P
microcontroller. Along with ATmega328P, it consists other components such as crystal
oscillator, serial communication, voltage regulator, etc. to support the
microcontroller. Arduino Uno has 14 digital input/output pins (out of which 6 can be
used as PWM outputs), 6 analog input pins, a USB connection, A Power barrel jack, an
ICSP header and a reset button. The next Taple contains the Technicak Specifications
on Arduino Uno:
Arduino Uno Technical Specifications
Microcontroller ATmega328P – 8 bit AVR family
microcontroller
Operating Voltage 5V
Recommended Input Voltage 7-12V
Input Voltage Limits 6-20V
Analog Input Pins 6 (A0 – A5)
Digital I/O Pins 14 (Out of which 6 provide PWM output)
DC Current on I/O Pins 40 mA
DC Current on 3.3V Pin 50 mA
Flash Memory 32 KB (0.5 KB is used for Bootloader)
SRAM 2 KB
EEPROM 1 KB
Frequency (Clock Speed) 16 MHz
In our project, Raspberry-pi send the decision to Arduino to control the DC motors,
Servo motor and Ultrasonic sensor ( figure 2.1 ) .
Arduino IDE (Integrated Development Environment) is required to program the
Arduino Uno board. For more details you can visit www.arduino.cc .

29 | P a g e
2.5 Ultrasonic Sensor
Ultrasonic Sensor HC-SR04 is a sensor that can measure distance. It emits an
ultrasound at 40 000 Hz (40kHz) which travels through the air and if there is an object
or obstacle on its path It will bounce back to the module. Considering the travel time
and the speed of the sound you can calculate the distance. The configuration pin of
HC-SR04 is VCC (1), TRIG (2), ECHO (3), and GND (4). The supply voltage of VCC is
+5V and you can attach TRIG and ECHO pin to any Digital I/O in your Arduino Board.
Figure 2.5. connection of Arduino and Ultrasonic
In order to generate the ultrasound, we need to set the Trigger Pin on a High State
for 10 µs. That will send out an 8-cycle sonic burst which will travel at the speed
sound and it will be received in the Echo Pin. The Echo Pin will output the time in
microseconds the sound wave traveled. For example, if the object is 20 cm away from
the sensor, and the speed of the sound is 340 m/s or 0.034 cm/µs the sound wave will
need to travel about 588 microseconds. But what you will get from the Echo pin will
be double that number because the sound wave needs to travel forward and bounce
backward. So, in order to get the distance in cm we need to multiply the received
travel time value from the echo pin by 0.034 and divide it by2. The Arduino code is
written below:
ultrasound

30 | P a g e
Results: After uploading the code, display the data with Serial Monitor. Now try to
give an object in front of the sensor and see the measurement.

31 | P a g e
2.6 servo Motor
Servo motors are great devices that can turn to a specified position. they have a servo
arm that can turn 180 degrees. Using the Arduino, we can tell a servo to go to a
specified position and it will go there.
We used Servo motors in our project to control the steering of the car.
A servo motor has everything built in: a motor, a feedback circuit, and a motor driver.
It just needs one power line, one ground, and one control pin.
Figure 2.6 Servo motor connection with Arduino
Following are the steps to connect a servo motor to the Arduino:
1. The servo motor has a female connector with three pins. The darkest or even
black one is usually the ground. Connect this to the Arduino GND.
2. Connect the power cable that in all standards should be red to 5V on the
Arduino.
3. Connect the remaining line on the servo connector to a digital pin on the
Arduino.

32 | P a g e
The following code will turn a servo motor to 0 degrees, wait 1 second, then turn it to
90, wait one more second, turn it to 180, and then go back.
The servo motor used to turn the robot right or left
Arduino. Depending on the angle from
Figure 2.6 Turn Robot Right Figure 2.6 Turn Robot left

33 | P a g e
2.7 L298N Dual H-Bridge Motor Driver and DC Motors
DC Motor:
A DC motor (Direct Current motor) is the most common type of motor. DC motors
normally have just two leads, one positive and one negative. If you connect these
two leads directly to a battery, the motor will rotate. If you switch the leads, the
motor will rotate in the opposite direction.
PWM DC Motor Control:
PWM, or pulse width modulation is a technique which allows us to adjust the average
value of the voltage that’s going to the electronic device by turning on and off the
power at a fast rate. The average voltage depends on the duty cycle, or the amount
of time the signal is ON versus the amount of time the signal is OFF in a single period
of time. We can control the speed of the DC motor by simply controlling the input
voltage to the motor and the most common method of doing that is by using PWM
signal.
So, depending on the size of the motor, we can simply connect an Arduino PWM
output to the base of transistor or the gate of a MOSFET and control the speed of the
motor by controlling the PWM output. The low power Arduino PWM signal switches
on and off the gate at the MOSFET through
which the high-power motor is driven.
Note that Arduino GND and the motor
power supply GND should be connected
Figure 2.7 Pulse Width Modulation.

34 | P a g e
Figure 2.8 Controlling DC motor using MOSFET.
Figure 2.9 H-Bridge DC Motor Control.

35 | P a g e
H-Bridge DC Motor Control:
For controlling the rotation direction, we just need to inverse the direction of the
current flow through the motor, and the most common method of doing that is by
using an H-Bridge. An H-Bridge circuit contains four switching elements, transistors or
MOSFETs, with the motor at the center forming an H-like configuration. Byactivating
two particular switches at the same time we can change the direction of the current
flow, thus change the rotation direction of the motor.
So, if we combine these two methods, the PWM and the H-Bridge, we can have a
complete control over the DC motor. There are many DC motor drivers that have
these features and the L298N is one of them.
L298N Driver:
The L298N is a dual H-Bridge motor driver which allows speed and direction control
of two DC motors at the same time. The module can drive DC motors that have
voltages between 5 and 35V, with a peak current up to 2A.
L298N module has two screw terminal blocks for the motor A and B, and another
screw terminal block for the Ground pin, the VCC for motor and a 5V pin which can
either be an input or output. This depends on the voltage used at the motors VCC.
The module have an onboard 5V regulator which is either enabled or disabled using
a jumper. If the motor supply voltage is up to 12V we can enable the 5V regulator
and the 5V pin can be used as output, for example for powering our Arduino board.
But if the motor voltage is greater than 12V we must disconnect the jumperbecause
those voltages will cause damage to the onboard 5V regulator. In this case the 5V
pin will be used as input as we need connect it to a 5V power supply in order the IC
to work properly. But this IC makes a voltage drop of about 2V.

36 | P a g e
L298N Motor Driver Specification
Operating Voltage 5V ~ 35V
Logic Power Output Vss 5V ~ 7V
Logical Current 0 ~36mA
Drive Current 2A (Max single bridge)
Max Power 25W
Controlling Level Low: -0.3V ~ 1.5V
High: 2.3V ~ Vss
Enable Signal Low: -0.3V ~ 1.5V
High: 2.3V ~ Vss
Table 2.10 L298N Motor Driver Specification.
Figure 2.11 L298N Motor Driver.
So, for example, if we use a 12V power supply, the voltage at motors terminals
will be about 10V, which means that we won’t be able to get the maximum speed
out of our 12V DC motor.

37 | P a g e
Next are the logic control inputs. The Enable A and Enable B pins are used for enabling
and controlling the speed of the motor. If a jumper is present on this pin, the motor
will be enabled and work at
maximum speed, and if we
remove the jumper we can
connect a PWM input to this
pin and in that way control the
speed of the motor. If we
connect this pin to a Ground
Figure 2.12 L298 control pins.
the motor will be disabled.
The Input 1 and Input 2 pins are used for controlling the rotation direction of the
motor A, and the inputs 3 and 4 for the motor B. Using these pins, we actually
control the switches of the H-Bridge inside the L298N IC. If input 1 is LOW and input 2
is HIGH the motor will move forward, and if input 1 is HIGH and input 2 is LOW the
motor will move backward. In case both inputs are same, either LOW or HIGH the
motor will stop. The same applies for the inputs 3 and 4 and the motor B.
Arduino and L298N
For example, we will control
the speed of the motor using a
potentiometer and change the
rotation direction using a push
button.
Figure 2.13 Arduino and L298N connection

38 | P a g e
Here’s the Arduino code:
Figure2-14 Arduino code.
we check whether we have pressed the button, and if that’s true, we will change the
rotation direction of the motor by setting the Input 1 and Input 2 states inversely. The
push button will work as toggle button and each time we press it, it will change the
rotation direction of the motor.

39 | P a g e
2.8 the Camera Module
The Raspberry Pi Camera Module is used to take pictures, record video, and apply
image effects. In our project we used it for computer vision to detect the lanes of the
road and detect the signs and cars.
Figure 2.14 Camera module. Figure 2.15 Raspberry pi.
Camera.
Figure 2.16 Camera connection with raspberry.
The Python picamera library allows us to control the Camera Module and create
amazing projects. A small example to use the camera on raspberry pi that allow us
to start camera preview for 5 seconds:

40 | P a g e
You can go to system integration chapter to see the usage of the camera in our
project, also you can visit raspberrypi website to see many projects on the raspberry
pi camera.

41 | P a g e
Deep Learning
Chapter 3

42 | P a g e
3.2 Introduction:
1) Grew out of work in AI
2) New capability for computers
Examples:
1) Database mining
Large datasets from growth of automation/web.
E.g., Web click data, medical records, biology, engineering
2) Applications can’t program by hand.
E.g., Autonomous helicopter, handwriting recognition, most of Natural
Language Processing (NLP), Computer Vision.
3.2 What is machine learning:
Arthur Samuel (1959):
Machine Learning: Field of study that gives computers the ability to learn
without being explicitly programmed.
Tom Mitchell (1998):
Well-posed Learning Problem: A computer program is said to learn from
experience E with respect to some task T and some performance measure P, if
its performance on T, as measured by P, improves with experience E.
3.2.3 Machine learning algorithms:
1) Supervised learning
2) Unsupervised learning
Others:
Reinforcement learning, recommender systems.

43 | P a g e
Training Set
Learning Algorithm
Input
h
Output
Supervised Learning: (Given the “right answer”)
for each example in the data.
Regression: Predict continuous valuedoutput.
Classification: Discrete valued output (0 or1).
Figure3.1 Learning Algorithm.
3.2.4 Classification:
Multiclass classification:
1) Email foldering, tagging: Work, Friends, Family, Hobby.
2)Medical diagrams: Not ill, Cold, Flu.
3)Weather: Sunny, Cloudy, Rain, Snow.

44 | P a g e
x2
Binary classification:
Multi-class classification:
x1
Regularization
x1

45 | P a g e
3.2.3 The problem of overfitting:
Overfitting: If we have too many features, the learned hypothesis may fit the
training set very well but fail to generalize to new examples (predict prices on new
examples).
3.3 Representation of Neural Networks:
What is this?
You see this:
But the camera sees this:

46 | P a g e
Cars
Not a car
Learning
Algorithm
pixel2
3.3Computer Vision:
Car detection
Testing:
What is this?
pixel1

47 | P a g e
pixel 1 intensity
pixel 2 intensity
pixel 2500 intensity
50 x 50-pixel images→ 2500 pixels, (7500 if RGB)
Quadratic features
( ): ≈3 million features
That’s why we need neural network.
Neural Network
Multi-class classification:
Pedestri Car Motorcy Truck

48 | P a g e
Multiple output units: One-vs-all.
when pedestrian when car when motorcycle
3.4 Training a neural network:
1) Pick a network architecture (connectivity pattern between neurons).
No. of input units: Dimension of features.
No. output units: Number of classes.

49 | P a g e
Reasonable default: 1 hidden layer, or if >1 hidden layer, have same no. of hidden
units in every layer (usually the more the better)
2)Randomly initialize weights.
3)Implement forward propagation to get h(theta) for any x(i).
4)Implement code to compute cost function J(theta).
5)Implement backprop to compute partial derivatives.
3.4.1 Advice for applying machine learning:
Debugging a learning algorithm:
Suppose you have implemented regularized linear regression to predict housing
prices. However, when you test your hypothesis in a new set of houses, you find that
it makes unacceptably large errors in its prediction. What should you try next?
1)Get more training examples
2)Try smaller sets of features
3)Try getting additional features
4)Try adding polynomialfeatures
5)Try decreasing lambda
6)Try increasing lambda
3.4.2 Neural networks and overfitting:
“Small” neural network
(fewer parameters; more prone to underfitting)
Computationally cheaper
“Large” neural network
(more parameters; more prone to overfitting)

50 | P a g e
Computationally more expensive.
Use regularization (lambda) to address overfitting.
3.5 Neural Networks and Introduction to Deep Learning:
Deep learning is a set of learning methods attempting to model data with
complex architectures combining different non-linear transformations. The
elementary bricks of deep learning are the neural networks, that are combined
to form the deep neural networks.
These techniques have enabled significant progress in the fields of sound
and image processing, including facial recognition, speech recognition, computer
vision, automated language processing, text classification (for example
spam recognition). Potential applications are very numerous. A spectacularly
example is the AlphaGo program, which learned to play the go game by the
deep learning method, and beated the world champion in 2016.
There exist several types of architectures for neural networks:
• The multilayer perceptrons: that are the oldest and simplest ones.
• The Convolutional Neural Networks (CNN): particularly adapted forimage
processing.
• The recurrent neural networks: used for sequential data such as textor
times series.
They are based on deep cascade of layers. They need clever stochastic
optimization algorithms, and initialization, and also a clever choice of the

51 | P a g e
structure. They lead to very impressive results, although very few theoretical
foundations are available till now.
Neural networks:
An artificial neural network is an application, nonlinear with respect to its
parameters θ that associates to an entry x an output y = f(x; θ). For the
sake of simplicity, we assume that y is unidimensional, but it could also be
multidimensional. This application f has a particular form that we will precise.
The neural networks can be used for regression or classification. As usual in
statistical learning, the parameters θ are estimated from a learning sample. The
function to minimize is not convex, leading to local minimizers. The success
of the method came from a universal approximation theorem due to Cybenko
(1989) and Horik (1991). Moreover, Le Cun (1986) proposed an efficient way
to compute the gradient of a neural network, called backpropagation of the
gradient, that allows to obtain a local minimizer of the quadratic criterion
easily.
Artificial Neuron:
An artificial neuron is a function fj of the input x = (x1; : : : ; xd) weighted
by a vector of connection weights wj = (wj;1; : : : ; wj;d), completed by a
neuron bias bj, and associated to an activation function φ, namely
yj = fj(x) = φ(hwj; xi + bj):
Several activation functions can be considered.
Convolutional neural networks:
For some types of data, especially for images, multilayer perceptrons are not
well adapted. Indeed, they are defined for vectors as input data, hence, to apply
them to images, we should transform the images into vectors, losing by the
way the spatial information contained in the images, such as forms. Before the
development of deep learning for computer vision, learning was based on the
extraction of variables of interest, called features, but these methods need a lot
of experience for image processing. The convolutional neural networks (CNN)
introduced by LeCun have revolutionized image processing,and
removed the manual extraction of features. CNN act directly on matrices,
or even on tensors for images with three RGB color channels. CNN are
now
widely used for image classification, image segmentation, object recognition,
face recognition.
Layers in a CNN:
A Convolutional Neural Network is composed by several kinds of layers,

52 | P a g e
that are described in this section: convolutional layers, pooling layers and
fully connected layers. After several convolution and pooling layers, the CNN
generally ends with several fully connected layers. The tensor that we have at the
output of these layers is transformed into a vector and then we add several
perceptron layers .Deep learning-based image recognition for autonomous
driving Various image recognition tasks were handled in the image recognition
field prior to 2010 by combining image local features manually designed by
researchers (called handcrafted features) and machine learning method.
After entering the 2010, However, many image recognition methods that use
deep learning have been proposed.
The image recognition methods using deep learning are far superior to the
methods used prior to the appearance of deep learning in general object
recognition competitions. Hence, this paper will explain how deep learning is
applied to the field of image recognition, and will also explain the latest trends of
deep learning-based autonomous driving.
In the late 1990s, it became possible to process a large amount of data at high speed
with the evolution of general-purpose computers. The mainstream method was to
extract a feature vector (called the image local features) from the image and apply a
machine learning method to perform image recognition. Supervised machine
learning requires a large amount of class-labeled training samples, but it does not
require researchers to design some rules as in the case of rule-based methods. So,
versatile image recognition can be realized. In the 2000 era, handcrafted features
such as scale-invariant feature transform (SIFT) and histogram of oriented gradients
(HOG) as image local features, designed based on the knowledge of researchers,
have been actively researched. By combining the image local features with machine
learning, practical applications of image recognition technology have advanced, as
represented by face detection. Next, in the late 2010s, deep learning to perform
feature extraction process through learning has come under the spotlight. A
handcrafted feature is not necessarily optimal because it extracts and expresses
feature values using a designed algorithm based on the knowledge of researchers.
Deep learning is an approach that can automate the feature extraction process and
is effective for image recognition. Deep learning has accomplished. M pressive
results in the general object recognition competitions, and the use of image
recognition required for autonomous driving (such as object detection and semantic
segmentation) is in progress. This paper explains how deep learning is applied to
each task in image recognition and how it is solved, and describes the trend of deep
learning-based autonomous driving and related problems.

53 | P a g e
3.6 Problem setting in image recognition:
In conventional machine learning (here, it is defined as a method prior to the time
when deep learning gained attention), it is difficult to directly solve general object
recognition tasks from the input image. This problem can be solved by distinguishing
the tasks of image identification, image classification, object detection, scene
understanding, and specific object recognition. Definitions of each task
and approaches to each task are described below.
Image verification:
Image verification is a problem to check whether the object in the
image is the same as the reference pattern. In image verification, the
distance between the feature vector of the reference pattern and the
feature vector of the input image is calculated. If the distance value is less
than a certain value, the images are determined as identical, and if the
value is more, it is determined otherwise. Fingerprint, face, and person
identification relates to tasks in which it is required to determine
whether an actual person is another person. In deep learning, the problem
of person identification is solved by designing a loss function (triplet loss
function) that calculates the value of distance between two images of the
same person as small, and the value of distance with another person's
image as large [1].
Figure3-2 General Object Recognition.

54 | P a g e
Object detection:
Object detection is the problem of finding the location of an object of a certain
category in the image. Practical face detection and pedestrian detection are
included in this task. Face detection uses a combination of Haar-like features [2]
and AdaBoost, and pedestrian detection uses HOG features [3] and support vector
machine (SVM). In conventional machine learning, object detection is achieved by
training 2-class classifiers corresponding to a certain category and raster scanning
in the image. In deep learning-based object detection, multiclass object detection
targeting several categories can be achieved with one network.
Image classification:
Image classification is a problem to find out the category to which an object in an
image belongs to, among predefined categories. In the conventional machine
learning, an approach called bag-of-features (BoF) has been used: a vector quantifies
the image local features and expresses the features of the whole image as a
histogram. Yet, deep learning is well-suited to the image classification task, and
became popular in 2015 by achieving an accuracy exceeding human recognition
performance in the 1000-class image classification task.
3.7 Convolutional Neural Network (CNN):
CNN computes the feature map corresponding to the kernel by convoluting the
kernel (weight filter) on the input image. Feature maps corresponding to the kernel
types can be computed as there are multiple kernels. Next, the size of the feature
map is reduced by the pooling feature map. As a result, it is possible to absorb
geometrical variations such as slight translation and rotation of the input image.
The convolution process and the pooling process are applied repeatedly to extract
the feature map. The extracted feature map is input to fully connected layers, and
the probability of each class is finally output. In this case, the input layer and the
output layer have a network structure that has units for the image and the number
of classes. Training of CNN is achieved by updating the parameters of the network by
the backpropagation method. The parameters in CNN refer to the kernel of the
convolutional layer and the weights of all coupled layers. The process flow of the
backpropagation method is shown:

55 | P a g e
First,training data is input to the network using the current parameters to obtain
the predictions (forward propagation). The error is calculated from the predictions
and the training label; the update amount of each parameter is obtained from the
error, and each parameter in the network is updated from the output layer toward
the input layer (back propagation). Training of CNN refers to repeating these
processes to acquire good parameters that can recognize the images correctly.
Scene understanding (semantic segmentation):
Scene understanding is the problem of understanding the scene structure in an
image. Above all, semantic segmentation that finds object categories in each pixel in
an image has been considered difficult to solve using conventional machine learning.
Therefore, it has been regarded as one of the ultimate problems of computer vision,
but it has been shown that it is a problem that can be solved by applying deep
learning.
Specific object recognition:
Specific object recognition is the problem of finding a specific object. By giving
attributes to objects with proper nouns, specific object recognition is defined as a
subtask of the general object recognition problem. Specific object recognition is
achieved by detecting feature points using SIFT from images, and a voting process
based on the calculation of distance from feature points of reference patterns.
Machine learning is not used here directly here, but the learned invariant feature
transform (LIFT) proposed in 2016 achieved an improvement in performance by
learning and replacing each process in SIFT through deep learning.
3.4 Deep learning-based image recognition:
Image recognition prior to deep learning is not always optimal because image
features are extracted and expressed using an algorithm designed based on the
knowledge of researchers, which is called a handcrafted feature. Convolutional
neural network (CNN) [7]), which is one type of deep learning, is an approach for
learning classification and feature extraction from training samples, as shown in This
chapter describes CNN, focuses on object detection and scene understanding
(semantic segmentation), and describes its application to image recognition and its
trends.

56 | P a g e
Figure3-3 Conventional machine learning and deep learning.
3.5 Convolutional Neural Network (CNN):
As shown in Fig. 3-4, CNN computes the feature map corresponding to the kernel by
convoluting the kernel (weight filter) on the input image. Feature maps
corresponding to the kernel types can be computed as there are multiple kernels.
Next, the size of the feature map is reduced by the pooling feature map. As a result,
it is possible to absorb geometrical variations such as slight translation and rotation
of the input image. The convolution process and the pooling process are applied
repeatedly to extract the feature map. The extracted feature map is input to fully-
connected layers, and the probability of each class is finally output. In this case, the
input layer and the output layer have a network structure that has units for the
image and the number of classes.

57 | P a g e
Figure3-4 Basic structure of CNN.
Training of CNN is achieved by updating the parameters of the network by the
backpropagation method. The parameters in CNN refer to the kernel of the
convolutional layer and the weights of all coupled layers. The process flow of the
backpropagation method is shown in Fig. 3-4First, training data is input to the
network using the current parameters to obtain the predictions (forward
propagation). The error is calculated from the predictions and the training label; the
update amount of each parameter is obtained from the error, and each parameter
in the network is updated from the output layer toward the input layer (back
propagation). Training of CNN refers to repeating these processes to acquire good
parameters that can recognize the images correctly.
3.7.1 Advantages of CNN compared to conventionalmachine
learning:
Fig.3-5 shows some visualization examples of kernels at the first convolution layer of
the AlexNet, which is designed for 1000 object class classification task at
ILSVRC(ImageNet Large Scale Visual Recognition Challenge) 2012.

58 | P a g e
AlexNet consists of five convolution layers and three fully-connected layers, whose
output layer has 10,000 units corresponding to the number of classes. We see that
the AlexNet has automatically acquired various filters that extract edge, texture, and
color information with directional components. We investigated the effectiveness of
the CNN filter as a local image feature by comparing the HOG in the human
detection task. The detection miss rate for CNN filters is 3%, while the HOG is 8%.
Although the CNN kernels of the AlexNet not trained for the human detection task,
the detection accuracy improved over the HOG feature that is the traditional
handcrafted feature.
Figure3-5 Network structure of AlexNet and Kernels.
As shown in Fig.3-6, CNN can perform not only image classification but also object
detection and semantic segmentation by designing the output layer according to
each task of image recognition. For example, if the output layer is designed to
output class probability and detection region for each grid, it will become a network
structure that can perform object detection. In semantic segmentation, the output
layer should be designed to output the class probability for each pixel. Convolution
and pooling layers can be used as common modules for these tasks.
On the other hand, in the conventional machine learning method, it was necessary
to design image local features for each task and combine it with machine learning.
CNN has the flexibility to be applied to various tasks by changing the network

59 | P a g e
structure, and this property is a great advantage in achieving image recognition.
Figure3-6 Application of CNN to each image recognition task.
3.7.2 Application of CNN to object detection task:
Conventional machine learning-based object detection is an approach that raster
scans two class classifiers. In this case, because the aspect ratio of the object to be
detected is constant, it will be object detection of only a certain category learned as
a positive sample. On the other hand, in object detection using CNN, object proposal
regions with different aspect are detected by CNN, and multiclass object detection is
possible using the Region Proposal approach that performs multiclass classification
with CNN for each detected region.
Faster R-CNN introduces Region Proposal Network (RPN) as shown in Fig.3-7, and
simultaneously detects object candidate regions and recognizes object classes in
those regions. First, convolution processing is performed on the entire input image
to obtain a feature map. In RPN, an object is detected by raster scanning the
detection window on the obtained feature map. In raster scanning, detection
windows in the form of k number of shapes are applied centered on focused areas

60 | P a g e
known as anchor. The region specified by the anchor is input to RPN, and the score
of object likeness and the detected coordinates on the input image are output. In
addition, the region specified by the anchor is also input to another all-connected
network, and object recognition is performed when it is determined to be an object
by RPN. Therefore, the unit of the output layer is the number obtained by addingthe
number of classes and ((x, y, w, h) × number of classes) to one rectangle. These
Region Proposal methods have made it possible to detect multiple classes of objects
with different aspect ratios.
Figure3-7 Faster R-CNN structure.
In 2016, the single-shot method was proposed as a new multiclass object detection
approach. This is a method to detect multiple objects only by giving the whole image
to CNN without raster scanning the image. YOLO (You Only Look Once) is a
representative method in which an object rectangle and an object category is output
for each local region divided by a 7 × 7 grid, as shown in Fig3-7. First, feature maps
are generated through convolution and pooling of input images. The position (i, j) of
each channel of the obtained feature map (7 × 7 × 1024) is a structure that becomes
a region feature corresponding to the grid (i, j) of the input image, and this feature
map is input to fully connected layers. The output values obtained through fully
connected layers are the score (20 categories) of the object category at each grid
position and the position, size, and reliability of the two object rectangles.
Therefore, the unit of the output layer is the number (1470) in which the position,
size, and reliability ((x, y, w, h, reliability) × 2) of two object rectangles is added to
the number of categories (20 categories) and multiplied with the number of grids (7
× 7). In YOLO, it is not necessary to detect object region candidates such as Faster R-
CNN; therefore, object detection can be performed in real time. Fig.3-8 shows an

61 | P a g e
example of YOLO-based multiclass object detection.
Figure3-8 YOLO structure and examples of multiclass object detection.
3.7.3 Application of CNN to semantic segmentation:
Semantic segmentation is a difficult task and has been studied for many years in the
field of computer vision. However, as with other tasks, deep learning-based methods
have been proposed and achieved much higher performance than conventional
machine learning methods. Fully convolutional network (FCN) is a method that
enables end-to-end learning and can obtain segmentation results using only CNN.
The structure of FCN is shown in Fig.3-9. The FCN has a network structure that does
not have a fully-connected layer.
The size of the generated feature map is reduced by repeatedly performing the
convolutional layer and the pooling layer on the input image. To make it the same
size as the original image, the feature map is enlarged 32 times in the final layer, and

62 | P a g e
convolution processing is performed. This is called deconvolution. The final layer
outputs the probability map of each class. The probability map is trained so that the
probability of the class in each pixel is obtained, and the unit of output of the end-
to-end segmentation model is (w × h × number of classes). Generally, the feature
map of the middle layer of CNN captures more detailed information as it is closer to
the input layer, and the pooling process integrates these pieces of information,
resulting in the loss of detailed information. When this feature map is expanded,
coarse segmentation results are obtained. Therefore, high accuracy is achieved by
integrating and using the feature map of the middle layer. Additionally, FCN
performs processing to integrate feature maps in the middle of the network.
Convolution process is performed by connecting mid-feature maps in the channel
direction, and segmentation results of the same size as the original image are
output.
Figure3-9 Fully Convolutional Network (FCN) Structure.
When expanding the feature map obtained on the encoder side, PSPNet can capture
information of different scales by using the Pyramid Pooling Module, which expands
at multiple scales. The Pyramid Pooling Module is used to pool feature maps with 1
× 1, 2 × 2, 3 × 3, 3 × 6 × 6 in which the vertical and horizontal sizes ofthe original
image are reduced to 1/8, respectively, on the encoder side. Then, convolution
process is performed on each feature map.
Next, the convolution process is performed and probability maps of each class are
output after expanding and linking feature maps to the same size. PSPNet is the
method that won in the “Scene parsing” category of ILSVRC held in 2016. Also, high
accuracy has been achieved with the Cityscapes Dataset taken with a dashboard
camera. Fig3-10 shows the result of PSPNet-based semantic segmentation.

63 | P a g e
Figure3-10 Example of PSPNet-based Semantic Segmentation Results (cited
from Reference [11]).
3.7.4 CNN for ADAS application:
The machine learning technique is applicable to use for system intelligence
implementation in ADAS(Advanced Driving Assistance System). In ADAS, it is to
facilitate the driver with the latest surrounding information obtained by sonar,
radar, and cameras. Although ADAS typically utilizes radar and sonar for long-range
detection, CNN-based systems can recently play a significant role in pedestrian
detection, lane detection, and redundant object detection at moderate distances.
For autonomous driving, the core component can be categorized into three
categories, namely perception, planning, and control. Perception refers to the
understanding of the environment, such as where obstacles located, detection of
road signs/marking, and categorizing objects by their semantic labels such as
pedestrians, bikes, and vehicles.
Localization refers to the ability of the autonomous vehicle to determine its position
in the environment. Planning refers to the process of making decisions in order to
achieve the vehicle's goals, typically to bring the vehicle from a start location to a
goal location while avoiding obstacles and optimizing the trajectory. Finally, the
control refers to the vehicle's ability to execute the planned actions. CNN-based
object detection is suitable for the perception because it can handle the multi-class
objects. Also, semantic segmentation is useful information for making decisions in

64 | P a g e
planning to avoid the obstacles by referring to pixels categorized as road.
3.8 Deep learning-based autonomous driving:
This chapter introduces end-to-end learning that can infer the control value of the
vehicle directly from the input image as the use of deep learning for autonomous
driving, and describes visual explanation of judgment grounds that is the problem of
deep learning models and future challenges.
3.8.1 End-to-end learning-based autonomous driving:
In most of the research on autonomous driving, the environment around the vehicle
is understood using a dashboard camera and Light Detection and Ranging (LiDAR),
appropriate traveling position is determined by motion planning, and the control
value of the vehicle is determined. Autonomous driving based on these three
processes is common, and deep learning-based object detection and semantic
segmentation introduced in Chapter 3 are beginning to be used to understand the
surrounding environment. On the other hand, with the progress in CNN research,
end-to-end learning-based method has been proposed that can infer the control
value of the vehicle directly from the input image.
In these methods, network is trained by using the images of the dashboard camera
when driven by a person, and the vehicle control value corresponding to each frame
as learning data. End-to-end learning-based autonomous driving control has the
advantage that the system configuration is simplified because CNN learns
automatically and consistently without explicit understanding of the surrounding
environment and motion planning.
To this end, Bojarski et al. proposed an end-to-end learning method for autonomous
driving, which input dash-board camera images into a CNN and outputs steering
angle directory. Started by this work, several works have been conducted: a method
considering temporal structure of a dash-board camera video or a method to train
CNN by using a driving simulator and use the trained network to control vehicle
under real environment. These methods basically control only steering angle and
throttle (i.e., accelerator and brake) is controlled by human.
According to autonomous driving model in Reference , it infers not only the steering
but also the throttle as the control value of the vehicle. The network structure is
composed of five layers of convolutional layers through pooling process and three
layers of fully-connected layers. In addition, the inference is made in consideration
of one's own state by giving the vehicle speed to the fully-connected layer in
addition to the dashboard images, since it is necessary to infer the change of speed

65 | P a g e
in one's own vehicle for throttle control. In this manner, high-precision control of
steering and throttle can be achieved in various driving scenarios.
3.8.2 Visual explanation of end-to-end learning:
CNN-based end-to-end learning has a problem where the basis of output control
value is not known. To address this problem, research is being conducted on an
approach on the judgment grounds (such as turning steering wheel to the left or
right and stepping on brakes) that can be understood by humans.
The common approach to clarify the reason of the network decision-making is a
visual explanation. Visual explanation method outputs an attention map that
visualizes the region in which the network focused as a heat map. Based on the
obtained attention map, we can analyze and understand the reason of the decision-
making. To obtain more explainable and clearer attention map for efficient visual
explanation, a number of methods have been proposed in the computer vision field.
Class activation mapping (CAM) generates attention maps by weighting the feature
maps obtained from the last convolutional layer in a network. A gradient- weighted
class activation mapping (Grad-CAM) is another common method, which generates
an attention map by using gradient values calculated at backpropagation process.
This method is widely used for a general analysis of CNNs because it can be applied
to any networks. Fig. 10 shows example attention maps of CAM and Grad-CAM.
Figure3-11 attention maps of CAM and Grad-CAM. (cite from reference [22]).
Visual explanation methods have been developed for general image recognition
tasks while visual explanation for autonomous driving has been also proposed.
Visual backprop is developed for visualize the intermediate values in a CNN, which
accumulates feature maps for each convolutional layer to a single map.

66 | P a g e
This enables us to understand where the network highly responds to the input
image. Reference proposes a Regression-type Attention Branch Network in which
a CNN is divided into a feature extractor and a regression branch, as shown in
Fig3-12, with an attention branch inserted that outputs an attention map that
serves as a visual explanation. By providing vehicle speed in fully connected layers
and through end-to-end learning of each branch of Regression-type Attention
Branch Network, control values for steering and throttle for various scenes can be
output, and also output the attention map that describes the location in which the
control value was output on the input image.
Fig3-13 shows an example of visualization of attention map during Regression-type
Attention Branch Network-based autonomous driving. S and T in the figure is the
steering value and throttle value, respectively. Fig3-13 (a) shows a scene where the
road curves to the right where there is a strong response to the center line of the
road, and the steering output value is a positive value indicating the right direction.
On the other hand, Fig3-13 (b) is a scene where the road curves to the left, the
steering output value is a negative value indicating the left direction, and the
attention map responds strongly to the white line on the right. By visualizing the
attention map in this way, it can be said that the center line of the road and the
position of the lane are observed for estimation of the steering value. Also, in the
scene where the car stops as shown in Fig3-13 (c), the attention map strongly
responds to the brake lamp of the vehicle ahead. The throttle output is 0, which
indicates that the accelerator and the brake are not pressed. Therefore, it is
understood that the condition of the vehicle ahead is closely watched in the
determination of the throttle. In addition, the night travel scenario in Fig3-13 (d)
shows a scene of following a car ahead, and it can be seen that the attention map
strongly responds to the car ahead because the road shape ahead is unknown. It is
possible to visually explain the judgment grounds through output of attention map
in this way.

67 | P a g e
Figure3-12 Regression-type Attention Branch Network. (cite from reference [17].
Figure3-13 Attention map-based visual explanation for self-driving.
3.8.3 Future challenges:
The visual explanations enable us to analyze and understand the internal state of
deep neural networks, which is efficient for engineers and researchers. One of the
future challenges is explanation for end users, i.e., passengers on a self-driving
vehicle. In case of fully autonomous driving, for instance, when lanes are suddenly
changed even when there are no vehicles ahead or on the side, the passenger in the
car may be concerned as to why the lanes were changed. In such cases, the
attention map visualization technology introduced in Section 4.2 enables people to
understand the reason for changing lanes. However, visualizing the attention map in
a fully automated vehicle does not make sense unless a person on the autonomous
vehicle always sees it. A person in an autonomous car, that is, a person whoreceives
the full benefit of AI, needs to be informed of the judgment grounds in the form of
text or voice stating, “Changing to left lane as a vehicle from the rear is approaching

68 | P a g e
with speed.” Transitioning from recognition results and visual explanation to verbal
explanation will be the challenges to confront in the future. In spite of the fact that
several attempts have been conducted for this purpose, it does not still achieve
sufficient accuracy and flexible verbal explanations.
Also, in the more distant future, such verbal explanation functions will eventually
not be used. At first, people who receive the full benefit of autonomous driving find
it difficult to accept, but a sense of trust will be gradually created by repeating the
verbal explanations. Thus, if confidence is established between autonomous driving
AI and the person, the verbal explanation functions will not be required, and it can
be expected that AI-based autonomous driving will be widely and generally
accepted.
3.9 Conclusion:
This explains how deep learning is applied in image recognition tasks and introduces
the latest image recognition technology using deep learning. Image recognition
technology using deep learning is the problem of finding an appropriate mapping
function from a large amount of data and teacher labels. Further, it is possible to
solve several problems simultaneously by using multitask learning.
Future prospects not only include “recognition” for input images, but also high
expectations for the development of end-to-end learning and deep reinforcement
learning technologies for “judgment” and “control” of autonomous vehicles.
Moreover, citing judgment grounds for output of deep learning and deep
reinforcement learning is a major challenge in practical application, and it is
desirable to expand from visual explanation to verbal explanation through
integration with natural language processing.

69 | P a g e
Object Detection
Chapter 3

70 | P a g e
4.1 Introduction:
Object detection is a computer vision task that involves both main tasks:
1) Localizing one or more objects within an image, and
2) Classifying each object in the image
This is done by drawing a bounding box around the identified object with its
predicted class. This means that the system doesn’t just predict the class of the
image like in image classification tasks. It also predicts the coordinates of the
bounding box that fits the detected object.
It is a challenging computer vision task because it requires both successful object
localization in order to locate and draw a bounding box around each object in an
image, and object classification to predict the correct class of object that was
localized.
Image Classification vs. Object Detection tasks.

71 | P a g e
Figure4-1 Classification and Object Detection.
Object detection is widely used in many fields. For example, in self-driving
technology, we need to plan routes by identifying the locations of vehicles,
pedestrians, roads, and obstacles in the captured video image. Robots often
perform this type of task to detect targets of interest. Systems in the security field
need to detect abnormal targets, such as intruders or bombs.
Now that you understand what object detection is and what differentiates it from
image classification tasks, let’s take a look at the general framework of object
detection projects.
1. First, we will explore the general framework of the object detection algorithms.
2. Then, we will dive deep into three of the most popular detection algorithms:
a. R-CNN family of networks
b. SSD
c. YOLO family of networks [3]
4.2 General object detection framework:
Typically, there are four components of an object detection framework.

72 | P a g e
4.2.1 Region proposal:
An algorithm or a deep learning model is used to generate regions of interest ROI
to be further processed by the system. These region proposals are regions that the
network believes might contain an object and output a large number of bounding
boxes with an objectness score. Boxes with large objectness score are then passed
along the network layers for further processing.
4.2.2 Feature extraction and network predictions:
visual features are extracted for each of the bounding boxes, they are evaluated
and it is determined whether and which objects are present in the proposals based
on visual features (i.e. an object classification component).
4.2.3 Non-maximum suppression (NMS):
At this step, the model has likely found multiple bounding boxes for the same
object. Non-max suppression helps avoid repeated detection of the same instance
by combining overlapping into a single bounding box for each object.
4.2.4 Evaluation metrics:
similar to accuracy, precision, and recall metrics in image classification tasks object
detection systems have their own metrics to evaluate their detection performance.
In this section we will explain the most popular metrics like mean average precision
(mAP), precision-recall curve (PR curve), and intersection over union (IoU).
Now, let’s dive one level deeper into each one of these components to build an
intuition on what their goals are.
4.3 Region proposals:
In this step, the system looks at the image and proposes regions of interest for
further analysis. The regions of interest (ROI) are regions that the systembelieves
that they have a high likelihood that they contain an object, called objectness
score. Regions with high objectness score are passed to the next steps whereas,
regions with low score are abandoned

73 | P a g e
Figure4-2 Low and Hight objectness score.
4.3.1 approaches to generate region proposals:
Originally, the ‘selective search’ algorithm was used to generate object proposals.
Other approaches use more complex visual features extracted by a deep neural
network from the image to generate regions (for example, based on the features
from a deep learning model). This step produces a lot (thousands) of bounding
boxes to be further analyzed and classified by the network. If the objectness score
is above a certain threshold, then this region is considered a foreground and
pushed forward in the network Note that this threshold is configurable based on
your problem. If the threshold is too low, your network will exhaustively generate
all possible proposals and you will have better chances to detect all objects in the
image.
On the flip side, this will be very computationally expensive and will slow down
your detections. So there is a trade-off that is made with region proposal
generation is the number of regions vs. the computational complexity and the right
approach is to use problem-specific information to reduce the number of ROI’s.

74 | P a g e
Figure4.3 An example of selective search applied to an image. A threshold can be tuned in the SS
algorithm to generate more or fewer proposals.
Network predictions:
This component includes the pretrained CNN network that is used for feature
extraction to extract features from the input image that are representative for the
task at hand and use these features to determine the class of the image.
In object detection frameworks, people typically use pretrained image classification
models to extract visual features, as these tend to generalize fairly well. For
example, a model trained on the MS COCO or ImageNet dataset is able to extract
fairly generic features.
In this step, the network analyzes all the regions that have been identified with high
likelihood of containing an object and makes two predictions for each region:
Bounding box prediction:
the coordinates that locate the box surrounding the object. The bounding box
coordinates are represented as the following tuple (x, y, w, h). Where the x and y
are the coordinates of the center point of the bounding box and w and h are the
width and height of the box.

75 | P a g e
Class prediction:
this is the classic softmax function that predicts the class probability
for each object.
Since there are thousands of regions proposed, each object will always have
multiple bounding boxes surrounding it with the correct classification.
We just need one bounding box for each object for most problems. Because what if
we are building a system to count dogs in an image? Our current system will count
5 dogs. We don’t want that. This is when the nonmaximum suppression technique
comes in handy. Object detector predicting 5 bounding boxes for the dog in the
image.
Figure4.4 Class prediction.
4.3.2 Non-maximum suppression (NMS):
One of the problems of object detection algorithms is that it may find multiple
detections of the same object. So, instead of creating only one bounding box
around the object, it draws multiple boxes for the same object. Non-maximum
suppression (NMS) is a technique that is used to make sure that the detection
algorithm detects each object only once. As the name implies, NMS techniquelooks
at all the boxes surrounding an object to find the box that has the maximum
prediction probabilities and suppress or eliminate the other boxes, hence the
name, non-maximum suppression.

76 | P a g e
Figure4-5 Predictions before and after NMS.
4.4 Steps of how the NMS algorithm works:
1) Discard all bounding boxes that have predictions that are less than a certain
threshold, called confidence threshold. This threshold is tunable. Thismeans
that the box will be suppressed if the prediction probability is less than the
set threshold.
2) Look at all the remaining boxes and select the bounding box with thehighest
probability.
3) Then calculate the overlap of the remaining boxes that have the same class
prediction. Bounding boxes that have high overlap with each other and are
predicting the same class are averaged together. This overlap metric iscalled
Intersection Over Union (IOU).
4) The algorithm then suppresses any box that has an IOU value that is smaller
than a certain threshold (called NMS threshold). Usually the NMS thresholdis
equal to 0.5 but it is tunable as well if you want to output less or more
bounding boxes.
NMS techniques are typically standard across the different detection frameworks.
4.4.1 Object detector evaluation metrics:
When evaluating the performance of an object detector
, we use two main evaluation metrics:
(frame-per-second) FPS TO MEASURE THE DETECTION SPEED:
The most common metric that is used to measure the detection speed is the
number of frames per second (FPS). For example, Faster R-CNN operates at only 7

77 | P a g e
(FPS) whereas SSD operates at 59 FPS.
MEAN AVERAGE PRECISION (MAP):
The most common evaluation metric that is used in object recognition tasks is
‘mAP’, which stands for mean average precision. It is a percentage from 0 to 100
and higher values are typically better, but its value is different from the accuracy
metric in classification. To understand the mAP, we need to understand the
Intersection Over Union (IOU) and the Precision-Recall Curve (PR Curve)
4.4.1.3 Intersection Over Union (IOU):
It is a measure that evaluates the overlap between two bounding boxes: the
ground truth bounding box Bground truth (box that we feed network to train on)
and the predicted bounding box Bpredicted (output of network).
By applying the IOU, we can tell if a detection is valid (True Positive) or not (False
Positive).
Figure4-6 Intersection Over Union (EQU)
Figure4-7 Ground-truth and
predicted box.

78 | P a g e
intersection over the union value ranges from 0, meaning no overlap at all, to 1
which means that the two bounding boxes 100% overlap on each other. The higher
the overlap between the two bounding boxes (IOU value), the better.
To calculate the IoU of a prediction, we need:
The ground-truth bounding box (Bground truth) the hand-labeled bounding box
that is created during the labeling process. The predicted bounding box
(Bpredicted) from our model.
IoU is used to define a “correct prediction”. Meaning, a “correct” prediction (True
Positive) is
one that has IoU greater than some threshold. This threshold is a tunable value
depending on the challenge but 0.5 is a standard value. For example, some
challenges like MS COCO, uses mAP@0.5 meaning IOU threshold = 0.5 or
mAP@0.75 meaning IOU threshold = 0.75. This means that if the IoU is above this
threshold is considered a True Positive (TP) and if it is below it is considered as a
False Positive (FP).
4.5 Precision-Recall Curve (PR Curve):
Precision:
is a metric that quantifies the number of correct positive predictions made. It is
calculated as the number of true positives divided by the total number of true
positives and false positives.
Precision = TruePositives / (TruePositives + FalsePositives)
Recall:
is a metric that quantifies the number of correct positive predictions made out of
all positive predictions that could have been made. It is calculated as the number of
true positives divided by the total number of true positives and false negatives (e.g.
it is the true positive rate).
Recall = TruePositives / (TruePositives + FalseNegatives)
The result is a value between 0.0 for no recall and 1.0 for full or perfect recall. Both
the precision and the recall are focused on the positive class (the minority class)
and are unconcerned with the true negatives (majority class).
PR Curve:
Plot of Recall (x) vs Precision (y). A model with perfect skill is depicted as a point at
a coordinate of (1,1). A skillful model is represented by a curve that bows towards a

79 | P a g e
coordinate of (1,1). A no-skill classifier will be a horizontal line on the plot with a
precision that is proportional to the number of positive examples in thedataset.
For a balanced dataset this will be 0.5.
Figure4-8 PR Curve.
A detector is considered good if its precision stays high as recall increases, which
means that if you vary the confidence threshold, the precision and recall will still be
high. On the other hand, a poor detector needs to increase the number of FPs
(lower precision) in order to achieve a high recall. That's why the PR curve usually
starts with high precision values, decreasing as recall increases.
Now, that we have the PR curve, we calculate the AP (Average Precision) by
calculating the Area Under the Curve (AUC). Then finally, mAP for object detection
is the average of the AP calculated for all the classes. It is also important to note
that for some papers, they use AP and mAP interchangeably.
4.6 Conclusion:
To recap, the mAP is calculated as follows:
1. Each bounding box will have an objectness score associated (probabilityof
the box containing an object).
2. Precision and recall are calculated.
3. Precision-recall curve (PR curve) is computed for each class by varying the
score threshold.
4. Calculate the average precision (AP): it is the area under the PR curve. Inthis
step, the AP is computed for each class.
5. Calculate mAP: the average AP over all the different classes.
Most deep learning object detection implementations handle computing mAP
for you.

80 | P a g e
Now, that we understand the general framework of object detection algorithms,
let’s dive deeper into three of the most popular detection algorithms.
4.7 Region-Based Convolutional Neural Networks (R-CNNs) [high mAP
and low FPS]:
Developed by Ross Girshick et al. in 2014 in their paper “Rich feature hierarchies for
accurate object detection and semantic
segmentation”.
The R-CNN family has then expanded to include Fast-RCNN and Faster-RCNN that
came out in 2015 and 2016.
R-CNN:
The R-CNN is the least sophisticated region-based architecture in its family, but itis
the basis for understanding how multiple object recognition algorithms work for all
of them. It may have
been one of the first large and successful applications of convolutional neural
networks to the problem of object detection and localization that paved the way
for the other advanced detection algorithms.
Figure4-9 Regions with CNN features.

81 | P a g e
The R-CNN model is comprised of four components:
Extract regions of interest (RoI):
- also known as extracting region proposals. These are regions that have a
high probability of containing an object. The way this is done is by using an
algorithm, called Selective Search, to scan the input image to find regions
that contain blobs and propose them as regions of interest to be processed
by the next modules in the pipeline. The proposed regions of interest are
then warped to have a
fixed size because they usually vary in size and as CNNs require fixed input
image size.
What is Selective Search?
Selective search is a greedy search algorithm that is used to provide region
proposals that potentially contain objects. It tries to find the areas that might
contain an object by combining similar pixels and textures into several rectangular
boxes. Selective Search combines the strength of both exhaustive search algorithm
(which examines all possible locations in the image) and bottom-up segmentation
algorithm (that hierarchically groups similar regions) to capture all possible object
locations.
Figure4-9 Input. Figure4-10 Output.

82 | P a g e
Feature Extraction module
- we run a pretrained convolutional network on top of the region proposalsto
extract features from each candidate region. This is the typical CNN feature
extractor.
Classification module
- train a classifier like Support Vector Machine (SVM), a traditionalmachine
learning algorithm, to classify candidate detections based on the extracted
features from the previous step.
Localization module
- also known as, bounding box regressor. Let’s take a step back to understand
regression. Machine learning problems are categorized as classification
and regression problems. Classification algorithms output a discrete, predefined
classes (dog, cat, and elephant) whereas regression algorithms outputcontinuous
value
predictions. In this module, we want to predict the location and size of the
bounding box that surrounds the object. The bounding box is represented by
identifying four values: the x and y coordinates of the box’s origin (x, y), the width,
and the height of the box (w, h). Putting this together, the regressors predicts the
four real-valued numbers that define the bounding box as the following tuple (x,y,
w, h).

83 | P a g e
Figure4-11 R-CNN architecture. Each proposed ROI is passed through the CNN to
extract features then an SVM.
4.7.1 HOW DO WE TRAIN R-CNNS?
R-CNNs are composed of four modules: selective search region proposal, feature
extractor, classifier, and bounding box regressor. All the R-CNN modules need to be
trained except for the selective search algorithm. So, in order to train R-CNNs, we
need to train the following modules:
1. Feature extractor CNN:
this is a typical CNN training process. In here, we either train a network from
scratch which rarely happens or fine-tune a pretrained network as we learned in
transfer learning part.
2. Train the SVM classifier:
the Support Vector Machine algorithm, but it is a traditional machine learning
classifier that is no different than deep learning classifiers in the sense that it needs
to be trained on labeled data.

Graduation project Book (Self-Driving Car)

Graduation project Book (Self-Driving Car)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Graduation project Book (Self-Driving Car)

Similar to Graduation project Book (Self-Driving Car) (20)

Recently uploaded

Recently uploaded (10)

Graduation project Book (Self-Driving Car)