Séminaire IA & VA- Yassine Ruichek, UTBM

1
Identification d’objets pour l’évaluation
de risque en conduite autonome
Séminaire IA & VA - 26 novembre 2021

• Introduction
– Présentation du laboratoire CIAD
– Véhicule autonome
• Attention visuelle
• Identification d’objets
OUTLINE

5
Perception
de
l’env
Explicabilité
Raisonnement hybride Raisonnement distribué
Champs scientifiques principaux
Data/Graph
Mining
Ingénierie des
connaissances
Apprentissage
machine
Vision par ordinateur
Robotique
Systèmes multiagents
Program
mation
logique
Optimisation
meta-heuristique
ou bio-inspirée
UX
Réalité
virtuelle
TA
L
Génie
logiciel
Simulation
Visualisa
tion
données

6
Axes de recherche transversaux
01 03
02
Véracité - Valeur
Perception,qualification de
la véracité et de la valeur de
connaissances dans des
environnements intelligents
massifs
Recommendation et simulation
prescriptive pour des systèmes
distribués ou complexes
Comportements et raisonnement
distribués dans des systèmes
complexes
(e.g. cyber-physical systems)
Simulation
prescriptive
Raisonnement distribué

7
Principaux domaines d’applications

8
Plate-forme MOBILITECH : véhicules autonomes, drones, robots et UX
Type véhicule GEM Little Renault
Scenic
Renault
Zoe
Hyundai
Matériel Télémètre,
caméra,
central
inertielle, …
Télémètre,
caméra,
central
inertielle,
pile à
combustible,
…
Télémètre,
caméra, …
Télémètre, caméra, central inertielle,
…
Quantité 1 1 2 2 3 1
Thématique Acquisition
dedonnées
sur route
classique
Système
d’aideà la
conduite
(ADAS);
conduite
autonome
en milieu
contrôlé
Conduiteen
convoi
ADAS; conduiteautonomesur route
ferméeou ouverte
+ Drones, robots domestiques (Toyota) et industriels (Hunter)
Vidéo

Autonomous Driving system consists of three main parts
• Perception
• Planning
• Control
and each part includes different tasks that are expected to be fully understood by the
system.
AUTONOMOUS VEHICLES

Sensors and Perception Models
Relies heavily on an extensive
infrastructure of active and
passive sensors.
PERCEPTION

PERCEPTION
Environmental Sensing
Technology
LIDAR
CAMERA
S
RADAR
ULTRASONIC SENSORS

PLANNING
Future Trajectories based on the results coming from
perception part.
Choosing appropriately ego trajectory, and driving
behavior is created and planned.

CONTROL
• The control part is deeply coupled with the perception and planning parts.
• It guarantees that the vehicle follows the course set by the planning part and
controls the vehicle’s hardware (acceleration, braking, and steering using drivers
and actuators) for safe driving.

ENVIRONMENT PERCEPTION
Integrating advanced sensing and perception models
Environmental Sensing → Sensor Fusion
Technology
Environmental Perception Models → Computer Vision Technology
PERCEPTION
Environment Analysis &
Understanding
(Segmentation,Detection,
Recognition, Tracking…..)
SENSING
Camera
Inertial
Ultra
Sonic
RADAR
GPS
LIDAR

The principal goal is:
• Designing computer systems that possess
the ability to capture, understand, and
interpret important visual information
contained with image, video and other
visual data.
• Then translate this data, using contextual
knowledge provided by human beings, into
insights used to drive decision making.
Computer Vision
Technology

Training computer vision systems used to
involve learning process all the way to the
smallest granular units of visual data – PIXEL.
• The system records and evaluates digital
images on the basis of its raw data alone,
where minute differences in pixel density,
color saturation, and levels of lightness and
darkness determine the structure and
therefore identity of the larger object.
Computer Vision
Technology

is one of the most remarkable things to come out of the deep learning and artificial
intelligence world. The advancements that deep learning has contributed to the computer
vision field have really set this field apart.
Computer Vision
Technology

Combining sensor & vision technologies leads to
Environment Perception → Scene Understanding
Scene Understanding: Analysis of a scene, considering the semantic and geometric
context of its contents and the internal relations between them.
Humans can classify, locate, segment, and identify objects and features at one look.

Humans can classify (object type and status moving/static), specify (spatial position),
identify (motion, position, direction, and velocity), and track these objects in the
driving scene.
In addition, humans focus their visual attention more on important or purposeful
elements and ignore unnecessary ones in their field of view.
Conferring these phenomenal abilities into machine-learning systems has been a long-
standing goal in the field of computer vision

Numerous approaches and methods (classic algorithms and deep learning) have been
proposed to improve scene understanding and extract semantic information about the driving
environment from images and videos.

ENVIRONMENT PERCEPTION – Visual Attention
Ability of sensing and perceiving the driving environment is a key technology for ADAS and
autonomous driving.
Humans' visual attention
• Predicting or locating potential risk
• Understanding the driving environment
• Quickly locate objects of interest

• “Visual Attention” - selection of
relevant and filtering out of
irrelevant information from
cluttered visual scenes.
• Attention - operate on regions of
space, particular features of an
object, or on entire objects.
• Attention can also be directed either
overtly or covertly.

How should a machine learning system or an autonomous vehicle
acquire this ability to detect such attentions for safe driving?
• To tackle this
– Incorporate saliency mechanism (Salient object detection model) as visual
attention model
– Saliency refers to unique features (pixels, resolution etc.) of the image in the context
of visual processing. These unique features depict the visually alluring locations in an
image. Saliency map is a topographical representation of them.
– Salient object detection models mimic the behaviour of human beings and capture
the most salient region/object from the images or scenes.

The concept of saliency map was first proposed by Christof
Koch in 1985.
Feature Integration Theory - 1980 " defining human visual
search strategies
“The salient areas in the visual scene are identified by
combining or relating visual features information such as
color, orientation, spatial frequency, brightness, direction of
movement that direct human attention”
The visual attention methods using saliency are divided into
two categories:
– Bottom-up: Biologically inspired methods Image color
and intensity are common examples
– Top-down: True computational methods
Prior knowledge, memories, goals are common factors.
Bottom-up
Top-down

• Saliency Algorithms
Review and test some well known Saliency algorithms both classic and deep learning
Fig: Diffèrent Saliency Algorithms test on driving scene images

Existing works
Human eye tracking into the process Adding object detection
In -lab Simulation
Real Driving
Berkeley DeepDrive Attention Dr(eye)VE SAGE

The research shows that these models have contributed a lot to the deployment of
attention and have significant progress.
Some limitations and drawbacks
– Complexity of capturing the true driver attention
– The fixations: Subjected to various characteristics of driver
• Driving experience & habits, preference & intentions, capabilities, culture, age,
gender, etc
– Eye tracker record single location at each moment that driver is gazing at, while he
may be looking at multiple important objects in the scene.
– The computational cost for developing a dataset with
saliency maps has been relatively expensive.

Our Aim: Develop a visual attention framework capable of predicting the important
objects (road context) simultaneously in a driving scene.
We came up with a new idea by shifting the problem from PREDICTION
What/where the driver is looking, or most drivers would look at to
SELECTION What the driver should/must look at during driving

Saliency Heat-Map as Visual Attention for Autonomous Driving
using Generative Adversarial Network (GAN)

Our Approach
Generative Adversarial Network - GAN’ (by Ian Goodfellow in 2014)
– GAN is a type of neural network architecture for generative modeling. It involves using a
model to generate new examples that plausibly come from an existing distribution of
samples, such as generating new photographs that are similar but specifically different
from a dataset of existing photographs
Applications: to name few
Image-to-Image Translation
Face Frontal
View Generation
Photos to Emojis Photo Inpainting

Framework
• We borrow the pix2pix GAN architecture which is suitable for image-to-image translation
task and can be conditioned on the input image for generating corresponding output image.

Data Collection
– The data nowadays for developing saliency models are collected from human eye fixation or
gazing. These data as saliency map (gray-scale or heat map image) are obtained using Gaussian
probability function, that exhibits the probability of each image pixel that captures human
attention.
Examples of fixation selection – Prediction points
(MITdataset)

Data Collection
– We propose a different approach for data collection by taking advantage of
semantic label information from driving scenes datasets.

Our Framework Experiments:
• Visual results on VADD validation set

• Visual results on different environments and weather conditions

• Test on EU long-term and Synthia Datasets

• Random Unseen Images

• Against Eye Fixation Attention Models.
Visual Comparison with Dr(eye)VE Project results

• Against Eye Fixation Attention Models.
Visual Comparison with BDD-A results

Ours BDD-A
Ours Dr(eye)Ve
BDD
• Against Eye Fixation Attention Models

BDD
• On VADD

Limitations and Future Work:
• Not all these objects are demanding all the time
• Detects false regions as salient
• Triggers false detection due to direct sunlight or reflections of light
 Incorporate depth, location, motion information, this could help for giving priorities within the
detected objects

• Based on our research into visual attention, we have concluded that in addition to visual
attention, we require the incorporation of depth, location, and motion information, which
could aid in the assignment of priorities among the detected objects.
• It is critical to have a thorough understanding of each of the surrounding elements
(including the motion and geometry information).
• Road-users are critical for perception, planning and decision-making for both self-driving
cars and driver assistance systems.
• Some road-users, however, are more important for decision making than others because
of their respective intentions, ego-vehicle’s intention and their effects on each other.

 Road Geometry
 Road Boundaries
 Road Users
 Road Semantics:

Aim: To propose a framework that can extract motion and geometry related information (i.e.,
object class, status, position, movement, speed and distance information) to identify object
(road-user) characteristics in the urban driving scenario.
Using these and other semantic cues, we may be able to determine the most important (prior)
objects in a given scene while driving.
ENVIRONMENT PERCEPTION – Object Identification

Motion and Geometry-related Information Fusion through
a Framework for Object Identification from a moving
camera in Urban Driving Scenarios

We have reviewed the contributions of works most related to ours,
– i.e., scene understanding for driving by combining motion and geometry-related information.
SMS-Net

FRAMEWORK FOR OBJECT IDENTIFICATION (FOI) - Workflow

• Framework's components includes
– Disparity Estimation through the Semi-Global Matching
– Motion Estimation by Image Registration and Optical Flow
– Moving Object Detection (MOD)
– Information Extraction and Fusion
DISPARITY ESTIMATION
– We adopt a well known Semi-Global Matching (SGM) algorithm.

MOTION ESTIMATION
– One of the most widely used methods is optical flow (OF) estimation.
– OF provides satisfactory results when the camera is fixed or carefully displaced.
– However, the optical flow from image sequences acquired by a moving camera encodes two
pieces of information.
• The motion of the surrounding objects
• The ego vehicle's motion
results in significant motion vectors associated with the static objects, leading to an
incorrect perception of static objects as moving objects.

MOTION ESTIMATION
Approach to Motion Compensation:
– By recent trends in aerial and medical imaging, we suggest a procedure to be used called image
registration together with the optical flow method to overcome ego-motion and obtain true
estimation of the motion information.
IMAGE REGISTRATION
With Image Registration
Without Image Registration

MOTION ESTIMATION
Computing Optical Flow

MOVING OBJECT DETECTION (MOD)
We followed a straightforward approach for moving object detection, that is
• To detect the objects of interest first
• Then, identify the moving ones from detected
objects from two consecutive frames
Current
frame
Previous
frame
Moving
Objects
OBJECTDETECTION
• Classification
• Localization
• Segmentation of all the
objects in the scene
IDENTIFY MOVING OBJECTS
• Recognitionof Moving
objectsfromdetectedobjects
inthe scene

E D
MOD
Moving Objects Mask
Moving Object
Detection
Encoder-Decoder
Network
Superimpose image frame and
corresponding moving object mask, and
integrate Bbox and Class information
Seg Mask
Bbox and Class
Segmentation
Network
The proposed architecture incorporates the object segmentation, and binary pixel classification based on
temporal information
The object segmentation part is a pre-trained CenterMask Lite segmentation network inference, which gives the bounding
boxes, category probabilities, and segmentation mask for each object of interest
The temporal processing part is an encoder-decoder network (EDNet) that identifies the moving objects using segmented
masks of consecutive frames

The proposed model labels the moving objects (as
white) and the static/background (as black) from
sequence pair images.

PROPOSED MOD DATASET
We developed a large dataset for moving object detection (from KITTI and EU-life long
datasets) covering all the dynamic objects like all types of vehicles, pedestrians, cyclists,
motorcyclists, bus, train, and truck.

MOVING OBJECT DETECTION (MOD) - RESULTS

MOVING OBJECT DETECTION (MOD) - RESULTS
EU-LT SEQ_02

FULLY COMPENSATED OPTICAL FLOW (FCOF)

FUSION OF MOD, FCOF AND DISPARITY
The results of each stage of the proposed framework, such as disparity, moving object
detection, and motion estimation, are fused to extract information such as object ID, static
or moving, distance, direction, position, and velocity.
– Labelling and Scaling

FOI - Results

FOI - Results
KITTI SEQ_05

FOI - Results
KITTI SEQ_84

FOI - Results
KITTI SEQ_71

FOI - Results
KITTI SEQ_59

CONCLUSION
• Proposal of a new framework for object identification (FOI) from a moving camera in complex urban
driving environment.
• The framework relies only upon the images captured from the stereo camera.
• It extracts the information related to the object, including class, status (moving/static), direction, velocity,
position, and distance from the ego vehicle.
• Other contributions related to Moving Object Detection (MOD) as it is considered as the critical task.
– A new dataset for Moving Object Détection is built from existing driving datasets KITTI and EU
long-term, covering dynamic objects like all types of vehicles, pedestrians, bicyclists, and
motorcyclists.
– Propose to use image registration as a tool for ego-motion compensation for urban driving scenarios.
– A new model for moving object detection is developed by integrating an encoder-decoder network
with a segmentation model.

LIMITATIONS: Through the experiments, it is found that
Objects overlap or very close to each other Object reflection Object motion speed is same as ego-vehicle speed

FUTUREWORK
– More attention should be paid to improving the overall speed of the proposed framework.
– Build a new dataset – Fully Compensated Optical Flow color maps.
– Future studies will be devoted to develop a framework that can prioritize objects in a driving
scene according to the situation.
– We will use the perception data (object identification) to plan secure and smooth trajectories for
the objects of interest using their dynamics limits, navigation comfort and safety, and the traffic
rules.
– Lane detection, traffic sign detection, and live traffic light detection could be integrated into the
framework, which would help the system in various tasks such as lane-change, obstacle
avoidance, and combined in critical driving situations.

Road-user importance estimation during a left turn maneuver
• In real-world driving, there can be a variety of items in the immediate neighborhood of the
ego-vehicle at any given time.
• Some items have a direct impact on the behavior of the ego-vehicle (e.g., brake, steer),
while others have the potential to be a danger, and still others do not pose a danger
currently or soon.

• The ability to determine how important or relevant a given object is to an ego-decision
vehicle's is critical for both driver assistance systems and self-driving vehicles.
Establishing trust with human drivers or passengers
Demonstrating transparency to law enforcement
Promoting a human-centric thought process etc
Road-user importance estimation during a left turn maneuver

THANKS
Contact:
Yassine Ruichek
Laboratoire CIAD
Université de Bourgogne Franche Comté (UBFC)
Université de Technologie Belfort Montbéliard (UTBM)
yassine.ruichek@utbm.fr– 06 73 22 10 19

Séminaire IA & VA- Yassine Ruichek, UTBM

Recommended

Recommended

More Related Content

Similar to Séminaire IA & VA- Yassine Ruichek, UTBM

Similar to Séminaire IA & VA- Yassine Ruichek, UTBM (20)

More from Mahdi Zarg Ayouna

More from Mahdi Zarg Ayouna (10)

Recently uploaded

Recently uploaded (20)

Séminaire IA & VA- Yassine Ruichek, UTBM