Elderly Assistance- Deep Learning Theme detection

Video Analytics – Theme DetectionComprehending Video Stream – A Deep Learning approach
Presented by:
Rishi Gargi (11010003)
Nitin Agarwal (11810058)
Shubhendra Vatsa (11810059)
Tanvi Mittal (11810129)
Certificate Programme in Business Analytics
Co2019 - Winter
Capstone Project Review – Dec 2019
Guided by:
Prof. V. Nagadevara
ISB, Hyderabad
Sponsored by:
Mr. Jatinder Kautish
Director (AI ML Labs)
Capgemini India Pvt. Ltd.

Today’s Objectives
2
Problem Summary and Key Objectives (3 minutes)
Approach and Methodology (20 minutes)
Results & Demonstration (7 minutes)
Q & A (10 minutes)

Problem Summary & Key Objective
3
 Ability to take input as Pre-processed videos and convert
them to a ML usable format
 Detect and categorize common objects and highlight major
activity themes i.e. Positive/Negative, Safe/Risk etc.
 Outline pose/gesture to draw sentiments
 Implement all above objectives in real-time fashion where
solution can be practically deployed
Video Surveillance and anomaly detection are upcoming themes in Machine Learning
vision 2020 with an expected CAGR of 50% annually. We aimed to create an algorithm
capable of processing videos, highlighting key themes along with possible anomalies

Used case for proof-of-concept
4
Key Limitations:
 Lack of labelled data
 Computing power limitation – long training time
 Avoid re-inventing the wheel
 Data privacy & security limitations
Problem Summary
• Proof-of-concept
Multiple used-cases
Proof-of-concept for Elderly Assistance
Why “Elderly Assistance”?
 Abundance of Training data
 Opportunity for applications from other advancements
 Simplistic yet robust solution thinking required
 No real data privacy or regulatory issues

5
Q & A (10 minutes)

``
 Microsoft COCO dataset
 Training data for YOLO_v3 object detection algorithm
 Robust dataset with 330K images and 1.5 million
objects across 80 common categories
 Source: http://cocodataset.org
Global Video &
Image Datasets
`
`
`
Approach and Methodology – Datasets
6
Approach and Methodology
• Datasets
• Approach
• Project Process Flow
 CMU Panoptic Dataset
 Training data for OpenPose: Pose detection algorithm
 Currently, 65 sequences (5.5 hours) and 1.5 millions
of 3D skeletons are available
 Source: https://cmu-perceptual-computing.org
 MPII human multi-person dataset
 Second training data to MPII human pose model
 Contains 25K images across 40K human bodies and
410 human activities
 Source: http://human-pose.mpi-inf.mpg.de/
`
 Kaggle Challenges in Representation Learning dataset
 Over 32K images of human faces with 7 different
labels for emotions.
 Source: https://www.kaggle.com/facial-expression-
recognition-challenge
Across the project, we plan to leverage some robust, pre-processed, publicly available datasets to attain the
final objective:
`
• Azure Cognitive Services Face API
• Used to detect, recognize, and analyze human faces
in video streams.
• Source : https://docs.microsoft.com/en-
in/azure/cognitive-services/face
Microsoft Azure
Face Recognition API
Hand Labelled
Dataset
(based on images
extracted
from movie scenes)

7
• Datasets
• Approach
To achieve a cohesive application, we have integrated 4 state-of-the-art computer vision models together to
create a power-full image processor. The key-question we are trying to answer are:
1. Possible Objectionable Objects?
Raw/Live
Video Feed
2. Unfamiliar faces in the surrounding?
3. Compromised situations?
4. Alarming emotions?
- Guns & Knifes
- Police uniforms
- Mob/too many people
- Canes/sticks etc.
- Family members
- Outsiders
- Falling down
- Slipping etc.
- Crying
- Sad/Angry
Stream Warning Score: 0-100

8
• Datasets
• Approach
We envisioned dividing the project objective into 5 discrete yet conjoined processes, which can come
together as an ensemble/composite learning model to give robust results:
1. Object
Detection
2. Pose
Estimation
3. Emotion
Detection
• Detect and label common
& uncommon objects in
image including persons,
objectionable objects
(guns, knives etc.)
• Fine tune the model to
signal partial detections
even if probability is low
• Number of frames for
which same
objectionable object is
detected
• Estimate pose of people
identified in the scene to
identify non-normal
behavior
• Estimate personal
interaction with
objectionable objects by
proximity and overlap
• Duration of pose across
frames
• Estimate facial
expressions and
sentiments of personals
detected
• Combined sentiment
score of the scene and
variability in score across
the scenes
Raw/Live
Video Feed
We used transfer learning to use & fine-tune pre-trained models
4. Facial
Recognition
• Identify familiar and
unfamiliar faces in a
scene to identify presence
of unidentified people
• Set threshold based on
number of unfamiliar
faces vs familiar faces
and track in-coming and
out-going faces from
video stream
5. Overall
warning score
• Generate a
warning score for
each scene and
hence identify
situations which
require human
interventions

Model Architecture
9
Integrated Model
Image - Processing
(2.5~2.7 FPS)
1. Object Detection
2. Pose Estimation
3. Sentiment Detection
4. Facial Recognition
Video Feed
Warning
Signals
Pre-trained NN
(Model X)
Weights trained
on basis of used
case in consideration
+ YOLO v3
(VGG 16)
(MS Cognitive API)
(Xception v3)
+ Transfer
Learning
Objects
(Type/
No.)
Human
Poses
(Scene
Avg.)
Human
Emotions
(Scene
Avg.)
Familiar/N
on-
familiar
faces
Warn
ing
Score
(0-
100)
4 0.2 0.4 2 10
3 0.3 0.6 4 50
2 0.5 0.8 2 100
… … … … ---
Stage – 1 (Individual Model Tuning) Stage – 2 (Integration) Stage – 3 (Top-up & Deployment)

The team leverage state of the art, YOLO-v3 model fine tuned using transfer learning for identification of key house-
hold objects. The algorithm allows recognition of 80 categories of common house-hold objects, but after leveraging
transfer learning it allows detection of additional objects like guns, knifes, police uniforms, canes etc.
10
• Datasets
• Approach
Approach and Methodology – Object Detection
Object Detection Pose Estimation Emotion DetectionFacial Recognition Warning Model
- YOLO itself provides real-time speed by using
a single 106 layer NN-framework which divides
the image into regions and predicts bounding
boxes and probabilities for each region. These
bounding boxes are weighted by the predicted
probabilities to come up with labels
(All layers Frozen)
(Features extracted for
each image)
(CNN Classifier)
Input: Image
Improved
identification for Guns
& Knifes

11
• Datasets
• Approach
Approach and Methodology – Pose Estimation
The team leveraged pose estimation which is a skeleton based real-time action recognition system, classifying
and recognizing base on framewise joints. We used Openpose, a realtime pose estimation architecture based on VGG
16/MobileNet. We used DeepSort Algorithm to assist tracking and object identification.
OpenPose represents
the first real-time multi-
person system to
jointly detect human
body, hand, facial, and
foot keypoints (in total
135 keypoints) on single
images.
DeepSort:
Multi-person Tracking
This module from the DeepSort
algorithm was put to use to
assist in tracking, locking
onto every single object in the
frame, uniquely identifying
each one of them and tracking
all of them until they leave the
frame.
We used action
recognition with DNN
for each person based
on single framewise
joints detected from
Openpose.
Action Recognition
Using DNN
Initial Model Results Integrated Model Results

12
• Datasets
• Approach
Approach and Methodology – Facial Recognition
The team included face recognition feature for security purpose using the Microsoft face detection – identification
algorithm. The API allows to detect, recognize and analyze human faces to differentiate known vs unknown faces.
The Azure Cognitive Services Face API
provides algorithms that are used to
detect, recognize, and analyze human
faces in images. We utilized face
recognition and identification to
differentiate between known and
unknown faces
The API is easy too train based on
limited set of images available for
personal in the surrounding.
Based on number of known and
unknown faces in a scene, the algorithm
calculates the overall risk score. The
system hence allows detection of entry
and exit of personal in an image and
hence calculation of actual risk score.

13
• Datasets
• Approach
Approach and Methodology – Emotion Detection
The team leverage state of the art, Xception CNN Model model by Google for identification of facial
emotions. The algorithm allows recognition of 7 categories of human emotions in a very real-time
fashion. The model has been trained on the Kaggle dataset for Facial emotion recognition (FER)
dataset with over 32k images.

14
• Datasets
• Approach
Approach and Methodology – Warning Model
The team leverage custom Tensorflow DNN model for identification of warning instances. The
model takes inputs from all 4 models and predicts the final warning score based on the same (0
being normal scene and 100 being highly vulnerable). The model has been trained on self-labeled
images from Indian TV show based on crime incidences ("Saavdhaan India").
Hand Labelled
Dataset
(based on images
extracted
from movie scenes)

Warning score model & API
• The integrated model was used to process a
series of pre-labelled images (hand labelled b/w
0-100 as overall warning score). This dataset was
then used to train a final neural network, which
was integrated into the final model to assess the
final warning score of each scene
Data Collection & Knowledge Gathering
• We conducted a detailed literature
review on speech and vision analysis
including research papers
• Familiarized with common datasets like
COCO, KeyPoint etc.
Project Process Flow & Progress
15
• Datasets
• Approach
Realizing pre-trained models (Transfer Learning)
• We leveraged the concept of transfer learning
and realized popular pre-existing models (or
used their weights) to save on model re-training
• YOLO-v3, VGG19, Xception are some models
which we have used/realized in the process Model Integration
• We integrated these re-tuned models into a
single uniform model which takes images as
inputs and outputs a stream of detections
based on individually tuned models
• The input can be a live feed using OpenCV or a
pre-recorded video stream
Transfer learning and semi-
supervised learning saved a lot of
training time for the team and
eliminate the requirement of
special GPUs for the project
Stage -1
Stage -2
Stage -3
Stage -4
We leveraged a 4-step process to execute the project:

16
Q & A (10 minutes)

Time for a short DEMO!!
17
Results & Insights

Warp-up & Future Scope
18
Results & Insights
The key results we saw from the DEMO:
Objects
Known vs Unknown
Faces
Emotions
Actions
SpeedScore
Future Scope & other potential applications:
 Fast and Accurate Results
 Easy to deploy and use
 Highly customizable
 Multiple Applications
Integrated Model
Image - Processing
(2.5~2.7 FPS)
1. Object Detection
2. Pose Estimation
3. Sentiment Detection
4. Facial Recognition
Video Feed
Warning
Signals
Pre-trained NN
(Model X)
Weights trained
on basis of used
case in consideration
+ YOLO v3
(VGG 16)
(MS Cognitive API)
(Xception v3)
+ Transfer
Learning
Retain algorithm
based on intended
application

19
Q & A (10 minutes)

20
We are open to any of your questions/queries

Elderly Assistance- Deep Learning Theme detection

More Related Content

What's hot

Similar to Elderly Assistance- Deep Learning Theme detection

Recently uploaded

Elderly Assistance- Deep Learning Theme detection