HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Inspiration

Real-Time Modelling Visual
Scenes with Biological
Inspiration
Kofi Appiah
Sheffield Hallam University

AI now and before
• Computer Vision and natural language processing have improved
significantly over the past 10 years.
• Image recognition and classification systems
• Apple photo organiser, Facebook face recognition.
• Robot use in warehouse
• Amazon warehouse robots (https://www.youtube.com/watch?v=4sEVX4mPuto)
• Medical image analysis for healthcare
• non-invasive diagnosis
• Agriculture, sport, manufacturing, autonomous cars technology.
• Crop yield, goal-line technology, defective products, people detection.
Human level face recognition Taigman et. al. CVPR2014

Why AI acceleration
• Better algorithms that learn from examples not predefined rules
• Deep learning
• Neural networks
• Machine perception
• Availability of data – Big Data
• Internet images, YouTube videos, Facebook images
• High Performance Computing
• Field Programmable Gate Arrays (FPGAs)
• Graphics Processing Units (GPU)
IEEE Spectrum

Key Achievements
• Visual recognition with high accuracies.
• 3D reconstruction of an environment
Mask R-CNN He et. al. ICCV2017
Litjens et. al. 2017
Johnson et. al. CVPR2015
Driverless cars - Mathworks
Faster R-CNN TPAMI 2017

Where things fall apart
• March 18, 2018, Uber’s autonomous car hit and killed 49-year-old as
she was walking her bike across the street.
• https://www.youtube.com/watch?v=7iTshCm41Ko
• Novel and imperfect system
• March 23 2018, autopilot Tesla slammed into concrete killing driver.
• Security robots attacking a kid in a shopping area, July 2016.
• Robot failure to open different doors – which training mode.
• Reinforced learning
• Supervised or Unsupervised?

Why things go wrong
• For autonomous cars, the state of the art is good and providing
bounding boxes of objects in the scene.
• What is missing is an interpretation of the scene.
• No contextual reasoning.
• Robot navigation
• Decision making might be optimal but not feasible or safe.
• Modelling in a crowded scene to infer interaction
• Modelling very unusual situations with little or no data
• Things that human are capable of, e.g. dealing with complex scenes
Fei-Fei Li

Unsupervised Background Subtraction
• Image Segmentation separate moving
objects from the background.
• Background subtraction is a practical
approach when the image sensor is
stationary.
• Background Modelling techniques
- Unimodal
- Multimodal

W4 and Grimson’s Algorithm – 2000s
• Requires manual initialization of
the Maximum (M), Minimum (m)
& inter-frame difference (D)
• Pixel x of image I is foreground if
|m(x)-It(x)|>D(x) or |M(x)-
It(x)|>D(x)
• Detection, Motion & change
history maps used for outdoor
scene.
• Use of fixed-point update values.
• Bimodal can’t model problems like
moving foliage and lighting
changes.
• Mixture of Gaussians with
associated weights to model each
pixel.
• Parameters are updated as follows:
• The first B distributions, ordered
by weight represents the
background
• Robust in modeling multimodal
background.
• Suffers from blending effect and
uses floating point in all updates

Efficient Hardware Implementation
• Maintains K clusters each with weight wk, central value ck
and implied global range [ck-15, ck+15]
• Weights and central values of all clusters are initialized 0,
and updated as follows:
• Uses both pixel and frame-level processing
• The first B distributions, ordered by weight represents the
background






+
=
−
−
otherwise
64
63
clustermatchingfor the
64
1
64
63
1,
1,
,
tk
tk
tk








+
=
−
−
otherwise
clustermatching
8
1
8
7
,,1,
,,,1,
,,,
jitk
jijitk
jitk
c
Xc
c






= =
b
k
ib TB
1
minarg 
Appiah et al FPT 2005

TULIPP – The game changer!
• Tools to help real-time computer vision developer to focus on:
• core application development by automating recurring, but critical,
tasks such as performance instrumentation
• Design space exploration and
• Vendor tool configuration.
• Making it possible for the designer to get the required
performance in speed, coupled with power constraints without
having to worry too much about the architecture.

Imaging before Deep Learning
Before
• Standard feature detectors
• SIFT, HOG, LBP
• Different algorithms for object
detection
• Requires small amount of data
• Useful for measurement and
labelling
After
• Featured are learnt and stacked
according to data
• Same algorithm that adapts to the
data
• Requires huge volume of data
• Useful for labelling
MathworksDalal & Triggs cc.gatech.edu

Deep CNN – Overview
• Uses convolution to preserve the spatial
structure of the input image
• Instead of a sigmoid activation function,
ReLU (rectified linear unit) is often used
• Encourages sparsity of synapses as
the value approaches zero (0).
Credit : Fei-Fei Li CS231n; Bala Amavasai – IEEE & M. Turner

Feature Maps - Several feature maps are used to identify various local features
• Several feature maps are used to
identify various local features.
• Each convolution filter can be tuned
to edges of different
• Orientation, Frequency, Phase, Colour, etc
• Capture some aspects of neural response
• But neural data not used in training

Sparse local connectivity
• For an input image of size 7x7
• The convolution filter 3x3
• The output image will be 5x5
• (Image – Filter )/stride + 1
• A sample filter for horizontal and
vertical gradient.

Way forward
• Computer Vision meets Cognitive Science and Neuroscience
Fei-Fei Li & Justin Johnson & Serena Yeung

The Challenge
• The success stories about the rise of Convolutional Neural Networks
(CNNs) capable of learning high-level features in object recognition
keeps increasing
• due to the availability of large datasets like ImageNet
• However, performance at scene recognition has not attained the same
level of success.
• Yet large scene databases like SUN and Places do exist
• Maybe the current deep features trained from ImageNet are not
competitive enough for such tasks.
• But do primates and humans actually do a raster scan to understand a
scene?
• CNNs fail to capture insensitivity to perturbations of an image

Possible Solution
• Performance accuracies in CNNs relies on a huge search space.
• The need for more biological guidance from the visual cortex
• Multi-disciplinary research in neuroscience, psychology,
physiology, shows that:
• object recognition in visual cortex is modulated via the ventral stream
• Neuronal signals from the retina are transformed into high-level
representation for object recognition.
• Computer Scientist working with neuroscientist, psychologist,
etc. would have better models for understanding scenes.

Reported Successes
• A biologically Inspired Deep CNN Model [Zhang et al. 2016]
• Simulates the V1, V2, V4 and IT layers of the human ventral stream
• Uses convolutional layers with varied sizes and complexities
• Increased concurrency for improved processing speed
• Outperformed seven other CNN techniques using four datasets.
• You Only Look Once (YOLOv2) [Redmon and Farhadi CVPR2017]
• Based on the assumption that humans glance at an image
• Does not rely on sliding window like other deep learning approaches
• Outperforms Deformable Part Models (DPM) and Regional CNN.

Scene understanding with DNN
• Learning Deep Features for Scene Recognition using Places
Database [Zhou et al. NIPS2014]
• Uses CNN to learn features from the scene
• Combined various local and global features to understand the scene
• Presents scene categories where machines perform like humans.
• Humans, but Not Deep Neural Networks, Often Miss Giant
Targets in Scenes [Eckstein et al. Current Biology 2017]
• Humans often miss unusual sized targets during visual search
• Deep learning does not exhibit such deficit with targets
• Is that a good thing or not?

Our motivation
• Missing giant targets is a functional brain strategy to discount
distractors
Eckstein et al. Current Biology 2017

Our Approach
• To understand how humans and primates recognise scenes
• Provide them with samples of indoor scenes
• Ask them to identify specific objects
• Observe their recall mechanism, if spatial relationship plays a role
• Model the scene to account for the experimental results
• Incorporate global and local descriptors
• Construct a relationship vector
Lunchroom image : PASSTA Dataset

Summary
• Computer vision and machine learning have improved over the
years, thanks to more data and processing power.
• Global scene understanding is still a challenge.
• Multi-disciplinary effort required to take computer vision to the
next level, acceptable for applications like driverless cars.
• We aim to combine positives of CNN with what humans are
good at for scene understanding.
• TULIPP offers the platform with toolchain to drive this agenda.

HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Inspiration

More Related Content

What's hot

Similar to HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Inspiration

More from Tulipp. Eu

Recently uploaded

HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Inspiration