It was a Capstone project for AMPBA class of 2019 Winter. It uses Deep Learning to analyse the theme of Video. It combines various pre-trained models, enhances them using Transfer learning for the context of Elderly assistance and gives us a Warning Score in real time for any suspicious activity.
1. Video Analytics – Theme DetectionComprehending Video Stream – A Deep Learning approach
Presented by:
Rishi Gargi (11010003)
Nitin Agarwal (11810058)
Shubhendra Vatsa (11810059)
Tanvi Mittal (11810129)
Certificate Programme in Business Analytics
Co2019 - Winter
Capstone Project Review – Dec 2019
Guided by:
Prof. V. Nagadevara
ISB, Hyderabad
Sponsored by:
Mr. Jatinder Kautish
Director (AI ML Labs)
Capgemini India Pvt. Ltd.
2. Today’s Objectives
2
Problem Summary and Key Objectives (3 minutes)
Approach and Methodology (20 minutes)
Results & Demonstration (7 minutes)
Q & A (10 minutes)
3. Problem Summary & Key Objective
3
Ability to take input as Pre-processed videos and convert
them to a ML usable format
Detect and categorize common objects and highlight major
activity themes i.e. Positive/Negative, Safe/Risk etc.
Outline pose/gesture to draw sentiments
Implement all above objectives in real-time fashion where
solution can be practically deployed
Video Surveillance and anomaly detection are upcoming themes in Machine Learning
vision 2020 with an expected CAGR of 50% annually. We aimed to create an algorithm
capable of processing videos, highlighting key themes along with possible anomalies
4. Used case for proof-of-concept
4
Key Limitations:
Lack of labelled data
Computing power limitation – long training time
Avoid re-inventing the wheel
Data privacy & security limitations
Problem Summary
• Proof-of-concept
Multiple used-cases
Proof-of-concept for Elderly Assistance
Why “Elderly Assistance”?
Abundance of Training data
Opportunity for applications from other advancements
Simplistic yet robust solution thinking required
No real data privacy or regulatory issues
5. Today’s Objectives
5
Problem Summary and Key Objectives (3 minutes)
Approach and Methodology (20 minutes)
Results & Demonstration (7 minutes)
Q & A (10 minutes)
6. ``
Microsoft COCO dataset
Training data for YOLO_v3 object detection algorithm
Robust dataset with 330K images and 1.5 million
objects across 80 common categories
Source: http://cocodataset.org
Global Video &
Image Datasets
`
`
`
Approach and Methodology – Datasets
6
Approach and Methodology
• Datasets
• Approach
• Project Process Flow
CMU Panoptic Dataset
Training data for OpenPose: Pose detection algorithm
Currently, 65 sequences (5.5 hours) and 1.5 millions
of 3D skeletons are available
Source: https://cmu-perceptual-computing.org
MPII human multi-person dataset
Second training data to MPII human pose model
Contains 25K images across 40K human bodies and
410 human activities
Source: http://human-pose.mpi-inf.mpg.de/
`
Kaggle Challenges in Representation Learning dataset
Over 32K images of human faces with 7 different
labels for emotions.
Source: https://www.kaggle.com/facial-expression-
recognition-challenge
Across the project, we plan to leverage some robust, pre-processed, publicly available datasets to attain the
final objective:
`
• Azure Cognitive Services Face API
• Used to detect, recognize, and analyze human faces
in video streams.
• Source : https://docs.microsoft.com/en-
in/azure/cognitive-services/face
Microsoft Azure
Face Recognition API
Hand Labelled
Dataset
(based on images
extracted
from movie scenes)
7. Approach and Methodology
7
Approach and Methodology
• Datasets
• Approach
• Project Process Flow
To achieve a cohesive application, we have integrated 4 state-of-the-art computer vision models together to
create a power-full image processor. The key-question we are trying to answer are:
1. Possible Objectionable Objects?
Raw/Live
Video Feed
2. Unfamiliar faces in the surrounding?
3. Compromised situations?
4. Alarming emotions?
- Guns & Knifes
- Police uniforms
- Mob/too many people
- Canes/sticks etc.
- Family members
- Outsiders
- Falling down
- Slipping etc.
- Crying
- Sad/Angry
Stream Warning Score: 0-100
8. Approach and Methodology
8
Approach and Methodology
• Datasets
• Approach
• Project Process Flow
We envisioned dividing the project objective into 5 discrete yet conjoined processes, which can come
together as an ensemble/composite learning model to give robust results:
1. Object
Detection
2. Pose
Estimation
3. Emotion
Detection
• Detect and label common
& uncommon objects in
image including persons,
objectionable objects
(guns, knives etc.)
• Fine tune the model to
signal partial detections
even if probability is low
• Number of frames for
which same
objectionable object is
detected
• Estimate pose of people
identified in the scene to
identify non-normal
behavior
• Estimate personal
interaction with
objectionable objects by
proximity and overlap
• Duration of pose across
frames
• Estimate facial
expressions and
sentiments of personals
detected
• Combined sentiment
score of the scene and
variability in score across
the scenes
Raw/Live
Video Feed
We used transfer learning to use & fine-tune pre-trained models
4. Facial
Recognition
• Identify familiar and
unfamiliar faces in a
scene to identify presence
of unidentified people
• Set threshold based on
number of unfamiliar
faces vs familiar faces
and track in-coming and
out-going faces from
video stream
5. Overall
warning score
• Generate a
warning score for
each scene and
hence identify
situations which
require human
interventions
9. Model Architecture
9
Integrated Model
Image - Processing
(2.5~2.7 FPS)
1. Object Detection
2. Pose Estimation
3. Sentiment Detection
4. Facial Recognition
Video Feed
Warning
Signals
Pre-trained NN
(Model X)
Weights trained
on basis of used
case in consideration
+ YOLO v3
(VGG 16)
(MS Cognitive API)
(Xception v3)
+ Transfer
Learning
Objects
(Type/
No.)
Human
Poses
(Scene
Avg.)
Human
Emotions
(Scene
Avg.)
Familiar/N
on-
familiar
faces
Warn
ing
Score
(0-
100)
4 0.2 0.4 2 10
3 0.3 0.6 4 50
2 0.5 0.8 2 100
… … … … ---
Stage – 1 (Individual Model Tuning) Stage – 2 (Integration) Stage – 3 (Top-up & Deployment)
10. The team leverage state of the art, YOLO-v3 model fine tuned using transfer learning for identification of key house-
hold objects. The algorithm allows recognition of 80 categories of common house-hold objects, but after leveraging
transfer learning it allows detection of additional objects like guns, knifes, police uniforms, canes etc.
10
Approach and Methodology
• Datasets
• Approach
• Project Process Flow
Approach and Methodology – Object Detection
Object Detection Pose Estimation Emotion DetectionFacial Recognition Warning Model
- YOLO itself provides real-time speed by using
a single 106 layer NN-framework which divides
the image into regions and predicts bounding
boxes and probabilities for each region. These
bounding boxes are weighted by the predicted
probabilities to come up with labels
(All layers Frozen)
(Features extracted for
each image)
(CNN Classifier)
Input: Image
Improved
identification for Guns
& Knifes
11. 11
Approach and Methodology
• Datasets
• Approach
• Project Process Flow
Approach and Methodology – Pose Estimation
Object Detection Pose Estimation Emotion DetectionFacial Recognition Warning Model
The team leveraged pose estimation which is a skeleton based real-time action recognition system, classifying
and recognizing base on framewise joints. We used Openpose, a realtime pose estimation architecture based on VGG
16/MobileNet. We used DeepSort Algorithm to assist tracking and object identification.
OpenPose represents
the first real-time multi-
person system to
jointly detect human
body, hand, facial, and
foot keypoints (in total
135 keypoints) on single
images.
DeepSort:
Multi-person Tracking
This module from the DeepSort
algorithm was put to use to
assist in tracking, locking
onto every single object in the
frame, uniquely identifying
each one of them and tracking
all of them until they leave the
frame.
We used action
recognition with DNN
for each person based
on single framewise
joints detected from
Openpose.
Action Recognition
Using DNN
Initial Model Results Integrated Model Results
12. 12
Approach and Methodology
• Datasets
• Approach
• Project Process Flow
Approach and Methodology – Facial Recognition
Object Detection Pose Estimation Emotion DetectionFacial Recognition Warning Model
The team included face recognition feature for security purpose using the Microsoft face detection – identification
algorithm. The API allows to detect, recognize and analyze human faces to differentiate known vs unknown faces.
The Azure Cognitive Services Face API
provides algorithms that are used to
detect, recognize, and analyze human
faces in images. We utilized face
recognition and identification to
differentiate between known and
unknown faces
The API is easy too train based on
limited set of images available for
personal in the surrounding.
Based on number of known and
unknown faces in a scene, the algorithm
calculates the overall risk score. The
system hence allows detection of entry
and exit of personal in an image and
hence calculation of actual risk score.
13. 13
Approach and Methodology
• Datasets
• Approach
• Project Process Flow
Approach and Methodology – Emotion Detection
Object Detection Pose Estimation Emotion DetectionFacial Recognition Warning Model
The team leverage state of the art, Xception CNN Model model by Google for identification of facial
emotions. The algorithm allows recognition of 7 categories of human emotions in a very real-time
fashion. The model has been trained on the Kaggle dataset for Facial emotion recognition (FER)
dataset with over 32k images.
14. 14
Approach and Methodology
• Datasets
• Approach
• Project Process Flow
Approach and Methodology – Warning Model
Object Detection Pose Estimation Emotion DetectionFacial Recognition Warning Model
The team leverage custom Tensorflow DNN model for identification of warning instances. The
model takes inputs from all 4 models and predicts the final warning score based on the same (0
being normal scene and 100 being highly vulnerable). The model has been trained on self-labeled
images from Indian TV show based on crime incidences ("Saavdhaan India").
Hand Labelled
Dataset
(based on images
extracted
from movie scenes)
15. Warning score model & API
• The integrated model was used to process a
series of pre-labelled images (hand labelled b/w
0-100 as overall warning score). This dataset was
then used to train a final neural network, which
was integrated into the final model to assess the
final warning score of each scene
Data Collection & Knowledge Gathering
• We conducted a detailed literature
review on speech and vision analysis
including research papers
• Familiarized with common datasets like
COCO, KeyPoint etc.
Project Process Flow & Progress
15
Approach and Methodology
• Datasets
• Approach
• Project Process Flow
Realizing pre-trained models (Transfer Learning)
• We leveraged the concept of transfer learning
and realized popular pre-existing models (or
used their weights) to save on model re-training
• YOLO-v3, VGG19, Xception are some models
which we have used/realized in the process Model Integration
• We integrated these re-tuned models into a
single uniform model which takes images as
inputs and outputs a stream of detections
based on individually tuned models
• The input can be a live feed using OpenCV or a
pre-recorded video stream
Transfer learning and semi-
supervised learning saved a lot of
training time for the team and
eliminate the requirement of
special GPUs for the project
Stage -1
Stage -2
Stage -3
Stage -4
We leveraged a 4-step process to execute the project:
16. Today’s Objectives
16
Problem Summary and Key Objectives (3 minutes)
Approach and Methodology (20 minutes)
Results & Demonstration (7 minutes)
Q & A (10 minutes)
18. Warp-up & Future Scope
18
Results & Insights
The key results we saw from the DEMO:
Objects
Known vs Unknown
Faces
Emotions
Actions
SpeedScore
Future Scope & other potential applications:
Fast and Accurate Results
Easy to deploy and use
Highly customizable
Multiple Applications
Integrated Model
Image - Processing
(2.5~2.7 FPS)
1. Object Detection
2. Pose Estimation
3. Sentiment Detection
4. Facial Recognition
Video Feed
Warning
Signals
Pre-trained NN
(Model X)
Weights trained
on basis of used
case in consideration
+ YOLO v3
(VGG 16)
(MS Cognitive API)
(Xception v3)
+ Transfer
Learning
Retain algorithm
based on intended
application
19. Today’s Objectives
19
Problem Summary and Key Objectives (3 minutes)
Approach and Methodology (20 minutes)
Results & Demonstration (7 minutes)
Q & A (10 minutes)