Video Analytics – Theme DetectionComprehending Video Stream – A Deep Learning approach
Presented by:
Rishi Gargi (11010003)
Nitin Agarwal (11810058)
Shubhendra Vatsa (11810059)
Tanvi Mittal (11810129)
Certificate Programme in Business Analytics
Co2019 - Winter
Capstone Project Review – Dec 2019
Guided by:
Prof. V. Nagadevara
ISB, Hyderabad
Sponsored by:
Mr. Jatinder Kautish
Director (AI ML Labs)
Capgemini India Pvt. Ltd.
Today’s Objectives
2
Problem Summary and Key Objectives (3 minutes)
Approach and Methodology (20 minutes)
Results & Demonstration (7 minutes)
Q & A (10 minutes)
Problem Summary & Key Objective
3
 Ability to take input as Pre-processed videos and convert
them to a ML usable format
 Detect and categorize common objects and highlight major
activity themes i.e. Positive/Negative, Safe/Risk etc.
 Outline pose/gesture to draw sentiments
 Implement all above objectives in real-time fashion where
solution can be practically deployed
Video Surveillance and anomaly detection are upcoming themes in Machine Learning
vision 2020 with an expected CAGR of 50% annually. We aimed to create an algorithm
capable of processing videos, highlighting key themes along with possible anomalies
Used case for proof-of-concept
4
Key Limitations:
 Lack of labelled data
 Computing power limitation – long training time
 Avoid re-inventing the wheel
 Data privacy & security limitations
Problem Summary
• Proof-of-concept
Multiple used-cases
Proof-of-concept for Elderly Assistance
Why “Elderly Assistance”?
 Abundance of Training data
 Opportunity for applications from other advancements
 Simplistic yet robust solution thinking required
 No real data privacy or regulatory issues
Today’s Objectives
5
Problem Summary and Key Objectives (3 minutes)
Approach and Methodology (20 minutes)
Results & Demonstration (7 minutes)
Q & A (10 minutes)
``
 Microsoft COCO dataset
 Training data for YOLO_v3 object detection algorithm
 Robust dataset with 330K images and 1.5 million
objects across 80 common categories
 Source: http://cocodataset.org
Global Video &
Image Datasets
`
`
`
Approach and Methodology – Datasets
6
Approach and Methodology
• Datasets
• Approach
• Project Process Flow
 CMU Panoptic Dataset
 Training data for OpenPose: Pose detection algorithm
 Currently, 65 sequences (5.5 hours) and 1.5 millions
of 3D skeletons are available
 Source: https://cmu-perceptual-computing.org
 MPII human multi-person dataset
 Second training data to MPII human pose model
 Contains 25K images across 40K human bodies and
410 human activities
 Source: http://human-pose.mpi-inf.mpg.de/
`
 Kaggle Challenges in Representation Learning dataset
 Over 32K images of human faces with 7 different
labels for emotions.
 Source: https://www.kaggle.com/facial-expression-
recognition-challenge
Across the project, we plan to leverage some robust, pre-processed, publicly available datasets to attain the
final objective:
`
• Azure Cognitive Services Face API
• Used to detect, recognize, and analyze human faces
in video streams.
• Source : https://docs.microsoft.com/en-
in/azure/cognitive-services/face
Microsoft Azure
Face Recognition API
Hand Labelled
Dataset
(based on images
extracted
from movie scenes)
Approach and Methodology
7
Approach and Methodology
• Datasets
• Approach
• Project Process Flow
To achieve a cohesive application, we have integrated 4 state-of-the-art computer vision models together to
create a power-full image processor. The key-question we are trying to answer are:
1. Possible Objectionable Objects?
Raw/Live
Video Feed
2. Unfamiliar faces in the surrounding?
3. Compromised situations?
4. Alarming emotions?
- Guns & Knifes
- Police uniforms
- Mob/too many people
- Canes/sticks etc.
- Family members
- Outsiders
- Falling down
- Slipping etc.
- Crying
- Sad/Angry
Stream Warning Score: 0-100
Approach and Methodology
8
Approach and Methodology
• Datasets
• Approach
• Project Process Flow
We envisioned dividing the project objective into 5 discrete yet conjoined processes, which can come
together as an ensemble/composite learning model to give robust results:
1. Object
Detection
2. Pose
Estimation
3. Emotion
Detection
• Detect and label common
& uncommon objects in
image including persons,
objectionable objects
(guns, knives etc.)
• Fine tune the model to
signal partial detections
even if probability is low
• Number of frames for
which same
objectionable object is
detected
• Estimate pose of people
identified in the scene to
identify non-normal
behavior
• Estimate personal
interaction with
objectionable objects by
proximity and overlap
• Duration of pose across
frames
• Estimate facial
expressions and
sentiments of personals
detected
• Combined sentiment
score of the scene and
variability in score across
the scenes
Raw/Live
Video Feed
We used transfer learning to use & fine-tune pre-trained models
4. Facial
Recognition
• Identify familiar and
unfamiliar faces in a
scene to identify presence
of unidentified people
• Set threshold based on
number of unfamiliar
faces vs familiar faces
and track in-coming and
out-going faces from
video stream
5. Overall
warning score
• Generate a
warning score for
each scene and
hence identify
situations which
require human
interventions
Model Architecture
9
Integrated Model
Image - Processing
(2.5~2.7 FPS)
1. Object Detection
2. Pose Estimation
3. Sentiment Detection
4. Facial Recognition
Video Feed
Warning
Signals
Pre-trained NN
(Model X)
Weights trained
on basis of used
case in consideration
+ YOLO v3
(VGG 16)
(MS Cognitive API)
(Xception v3)
+ Transfer
Learning
Objects
(Type/
No.)
Human
Poses
(Scene
Avg.)
Human
Emotions
(Scene
Avg.)
Familiar/N
on-
familiar
faces
Warn
ing
Score
(0-
100)
4 0.2 0.4 2 10
3 0.3 0.6 4 50
2 0.5 0.8 2 100
… … … … ---
Stage – 1 (Individual Model Tuning) Stage – 2 (Integration) Stage – 3 (Top-up & Deployment)
The team leverage state of the art, YOLO-v3 model fine tuned using transfer learning for identification of key house-
hold objects. The algorithm allows recognition of 80 categories of common house-hold objects, but after leveraging
transfer learning it allows detection of additional objects like guns, knifes, police uniforms, canes etc.
10
Approach and Methodology
• Datasets
• Approach
• Project Process Flow
Approach and Methodology – Object Detection
Object Detection Pose Estimation Emotion DetectionFacial Recognition Warning Model
- YOLO itself provides real-time speed by using
a single 106 layer NN-framework which divides
the image into regions and predicts bounding
boxes and probabilities for each region. These
bounding boxes are weighted by the predicted
probabilities to come up with labels
(All layers Frozen)
(Features extracted for
each image)
(CNN Classifier)
Input: Image
Improved
identification for Guns
& Knifes
11
Approach and Methodology
• Datasets
• Approach
• Project Process Flow
Approach and Methodology – Pose Estimation
Object Detection Pose Estimation Emotion DetectionFacial Recognition Warning Model
The team leveraged pose estimation which is a skeleton based real-time action recognition system, classifying
and recognizing base on framewise joints. We used Openpose, a realtime pose estimation architecture based on VGG
16/MobileNet. We used DeepSort Algorithm to assist tracking and object identification.
OpenPose represents
the first real-time multi-
person system to
jointly detect human
body, hand, facial, and
foot keypoints (in total
135 keypoints) on single
images.
DeepSort:
Multi-person Tracking
This module from the DeepSort
algorithm was put to use to
assist in tracking, locking
onto every single object in the
frame, uniquely identifying
each one of them and tracking
all of them until they leave the
frame.
We used action
recognition with DNN
for each person based
on single framewise
joints detected from
Openpose.
Action Recognition
Using DNN
Initial Model Results Integrated Model Results
12
Approach and Methodology
• Datasets
• Approach
• Project Process Flow
Approach and Methodology – Facial Recognition
Object Detection Pose Estimation Emotion DetectionFacial Recognition Warning Model
The team included face recognition feature for security purpose using the Microsoft face detection – identification
algorithm. The API allows to detect, recognize and analyze human faces to differentiate known vs unknown faces.
The Azure Cognitive Services Face API
provides algorithms that are used to
detect, recognize, and analyze human
faces in images. We utilized face
recognition and identification to
differentiate between known and
unknown faces
The API is easy too train based on
limited set of images available for
personal in the surrounding.
Based on number of known and
unknown faces in a scene, the algorithm
calculates the overall risk score. The
system hence allows detection of entry
and exit of personal in an image and
hence calculation of actual risk score.
13
Approach and Methodology
• Datasets
• Approach
• Project Process Flow
Approach and Methodology – Emotion Detection
Object Detection Pose Estimation Emotion DetectionFacial Recognition Warning Model
The team leverage state of the art, Xception CNN Model model by Google for identification of facial
emotions. The algorithm allows recognition of 7 categories of human emotions in a very real-time
fashion. The model has been trained on the Kaggle dataset for Facial emotion recognition (FER)
dataset with over 32k images.
14
Approach and Methodology
• Datasets
• Approach
• Project Process Flow
Approach and Methodology – Warning Model
Object Detection Pose Estimation Emotion DetectionFacial Recognition Warning Model
The team leverage custom Tensorflow DNN model for identification of warning instances. The
model takes inputs from all 4 models and predicts the final warning score based on the same (0
being normal scene and 100 being highly vulnerable). The model has been trained on self-labeled
images from Indian TV show based on crime incidences ("Saavdhaan India").
Hand Labelled
Dataset
(based on images
extracted
from movie scenes)
Warning score model & API
• The integrated model was used to process a
series of pre-labelled images (hand labelled b/w
0-100 as overall warning score). This dataset was
then used to train a final neural network, which
was integrated into the final model to assess the
final warning score of each scene
Data Collection & Knowledge Gathering
• We conducted a detailed literature
review on speech and vision analysis
including research papers
• Familiarized with common datasets like
COCO, KeyPoint etc.
Project Process Flow & Progress
15
Approach and Methodology
• Datasets
• Approach
• Project Process Flow
Realizing pre-trained models (Transfer Learning)
• We leveraged the concept of transfer learning
and realized popular pre-existing models (or
used their weights) to save on model re-training
• YOLO-v3, VGG19, Xception are some models
which we have used/realized in the process Model Integration
• We integrated these re-tuned models into a
single uniform model which takes images as
inputs and outputs a stream of detections
based on individually tuned models
• The input can be a live feed using OpenCV or a
pre-recorded video stream
Transfer learning and semi-
supervised learning saved a lot of
training time for the team and
eliminate the requirement of
special GPUs for the project
Stage -1
Stage -2
Stage -3
Stage -4
We leveraged a 4-step process to execute the project:
Today’s Objectives
16
Problem Summary and Key Objectives (3 minutes)
Approach and Methodology (20 minutes)
Results & Demonstration (7 minutes)
Q & A (10 minutes)
Time for a short DEMO!!
17
Results & Insights
Warp-up & Future Scope
18
Results & Insights
The key results we saw from the DEMO:
Objects
Known vs Unknown
Faces
Emotions
Actions
SpeedScore
Future Scope & other potential applications:
 Fast and Accurate Results
 Easy to deploy and use
 Highly customizable
 Multiple Applications
Integrated Model
Image - Processing
(2.5~2.7 FPS)
1. Object Detection
2. Pose Estimation
3. Sentiment Detection
4. Facial Recognition
Video Feed
Warning
Signals
Pre-trained NN
(Model X)
Weights trained
on basis of used
case in consideration
+ YOLO v3
(VGG 16)
(MS Cognitive API)
(Xception v3)
+ Transfer
Learning
Retain algorithm
based on intended
application
Today’s Objectives
19
Problem Summary and Key Objectives (3 minutes)
Approach and Methodology (20 minutes)
Results & Demonstration (7 minutes)
Q & A (10 minutes)
20
We are open to any of your questions/queries

Elderly Assistance- Deep Learning Theme detection

  • 1.
    Video Analytics –Theme DetectionComprehending Video Stream – A Deep Learning approach Presented by: Rishi Gargi (11010003) Nitin Agarwal (11810058) Shubhendra Vatsa (11810059) Tanvi Mittal (11810129) Certificate Programme in Business Analytics Co2019 - Winter Capstone Project Review – Dec 2019 Guided by: Prof. V. Nagadevara ISB, Hyderabad Sponsored by: Mr. Jatinder Kautish Director (AI ML Labs) Capgemini India Pvt. Ltd.
  • 2.
    Today’s Objectives 2 Problem Summaryand Key Objectives (3 minutes) Approach and Methodology (20 minutes) Results & Demonstration (7 minutes) Q & A (10 minutes)
  • 3.
    Problem Summary &Key Objective 3  Ability to take input as Pre-processed videos and convert them to a ML usable format  Detect and categorize common objects and highlight major activity themes i.e. Positive/Negative, Safe/Risk etc.  Outline pose/gesture to draw sentiments  Implement all above objectives in real-time fashion where solution can be practically deployed Video Surveillance and anomaly detection are upcoming themes in Machine Learning vision 2020 with an expected CAGR of 50% annually. We aimed to create an algorithm capable of processing videos, highlighting key themes along with possible anomalies
  • 4.
    Used case forproof-of-concept 4 Key Limitations:  Lack of labelled data  Computing power limitation – long training time  Avoid re-inventing the wheel  Data privacy & security limitations Problem Summary • Proof-of-concept Multiple used-cases Proof-of-concept for Elderly Assistance Why “Elderly Assistance”?  Abundance of Training data  Opportunity for applications from other advancements  Simplistic yet robust solution thinking required  No real data privacy or regulatory issues
  • 5.
    Today’s Objectives 5 Problem Summaryand Key Objectives (3 minutes) Approach and Methodology (20 minutes) Results & Demonstration (7 minutes) Q & A (10 minutes)
  • 6.
    ``  Microsoft COCOdataset  Training data for YOLO_v3 object detection algorithm  Robust dataset with 330K images and 1.5 million objects across 80 common categories  Source: http://cocodataset.org Global Video & Image Datasets ` ` ` Approach and Methodology – Datasets 6 Approach and Methodology • Datasets • Approach • Project Process Flow  CMU Panoptic Dataset  Training data for OpenPose: Pose detection algorithm  Currently, 65 sequences (5.5 hours) and 1.5 millions of 3D skeletons are available  Source: https://cmu-perceptual-computing.org  MPII human multi-person dataset  Second training data to MPII human pose model  Contains 25K images across 40K human bodies and 410 human activities  Source: http://human-pose.mpi-inf.mpg.de/ `  Kaggle Challenges in Representation Learning dataset  Over 32K images of human faces with 7 different labels for emotions.  Source: https://www.kaggle.com/facial-expression- recognition-challenge Across the project, we plan to leverage some robust, pre-processed, publicly available datasets to attain the final objective: ` • Azure Cognitive Services Face API • Used to detect, recognize, and analyze human faces in video streams. • Source : https://docs.microsoft.com/en- in/azure/cognitive-services/face Microsoft Azure Face Recognition API Hand Labelled Dataset (based on images extracted from movie scenes)
  • 7.
    Approach and Methodology 7 Approachand Methodology • Datasets • Approach • Project Process Flow To achieve a cohesive application, we have integrated 4 state-of-the-art computer vision models together to create a power-full image processor. The key-question we are trying to answer are: 1. Possible Objectionable Objects? Raw/Live Video Feed 2. Unfamiliar faces in the surrounding? 3. Compromised situations? 4. Alarming emotions? - Guns & Knifes - Police uniforms - Mob/too many people - Canes/sticks etc. - Family members - Outsiders - Falling down - Slipping etc. - Crying - Sad/Angry Stream Warning Score: 0-100
  • 8.
    Approach and Methodology 8 Approachand Methodology • Datasets • Approach • Project Process Flow We envisioned dividing the project objective into 5 discrete yet conjoined processes, which can come together as an ensemble/composite learning model to give robust results: 1. Object Detection 2. Pose Estimation 3. Emotion Detection • Detect and label common & uncommon objects in image including persons, objectionable objects (guns, knives etc.) • Fine tune the model to signal partial detections even if probability is low • Number of frames for which same objectionable object is detected • Estimate pose of people identified in the scene to identify non-normal behavior • Estimate personal interaction with objectionable objects by proximity and overlap • Duration of pose across frames • Estimate facial expressions and sentiments of personals detected • Combined sentiment score of the scene and variability in score across the scenes Raw/Live Video Feed We used transfer learning to use & fine-tune pre-trained models 4. Facial Recognition • Identify familiar and unfamiliar faces in a scene to identify presence of unidentified people • Set threshold based on number of unfamiliar faces vs familiar faces and track in-coming and out-going faces from video stream 5. Overall warning score • Generate a warning score for each scene and hence identify situations which require human interventions
  • 9.
    Model Architecture 9 Integrated Model Image- Processing (2.5~2.7 FPS) 1. Object Detection 2. Pose Estimation 3. Sentiment Detection 4. Facial Recognition Video Feed Warning Signals Pre-trained NN (Model X) Weights trained on basis of used case in consideration + YOLO v3 (VGG 16) (MS Cognitive API) (Xception v3) + Transfer Learning Objects (Type/ No.) Human Poses (Scene Avg.) Human Emotions (Scene Avg.) Familiar/N on- familiar faces Warn ing Score (0- 100) 4 0.2 0.4 2 10 3 0.3 0.6 4 50 2 0.5 0.8 2 100 … … … … --- Stage – 1 (Individual Model Tuning) Stage – 2 (Integration) Stage – 3 (Top-up & Deployment)
  • 10.
    The team leveragestate of the art, YOLO-v3 model fine tuned using transfer learning for identification of key house- hold objects. The algorithm allows recognition of 80 categories of common house-hold objects, but after leveraging transfer learning it allows detection of additional objects like guns, knifes, police uniforms, canes etc. 10 Approach and Methodology • Datasets • Approach • Project Process Flow Approach and Methodology – Object Detection Object Detection Pose Estimation Emotion DetectionFacial Recognition Warning Model - YOLO itself provides real-time speed by using a single 106 layer NN-framework which divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities to come up with labels (All layers Frozen) (Features extracted for each image) (CNN Classifier) Input: Image Improved identification for Guns & Knifes
  • 11.
    11 Approach and Methodology •Datasets • Approach • Project Process Flow Approach and Methodology – Pose Estimation Object Detection Pose Estimation Emotion DetectionFacial Recognition Warning Model The team leveraged pose estimation which is a skeleton based real-time action recognition system, classifying and recognizing base on framewise joints. We used Openpose, a realtime pose estimation architecture based on VGG 16/MobileNet. We used DeepSort Algorithm to assist tracking and object identification. OpenPose represents the first real-time multi- person system to jointly detect human body, hand, facial, and foot keypoints (in total 135 keypoints) on single images. DeepSort: Multi-person Tracking This module from the DeepSort algorithm was put to use to assist in tracking, locking onto every single object in the frame, uniquely identifying each one of them and tracking all of them until they leave the frame. We used action recognition with DNN for each person based on single framewise joints detected from Openpose. Action Recognition Using DNN Initial Model Results Integrated Model Results
  • 12.
    12 Approach and Methodology •Datasets • Approach • Project Process Flow Approach and Methodology – Facial Recognition Object Detection Pose Estimation Emotion DetectionFacial Recognition Warning Model The team included face recognition feature for security purpose using the Microsoft face detection – identification algorithm. The API allows to detect, recognize and analyze human faces to differentiate known vs unknown faces. The Azure Cognitive Services Face API provides algorithms that are used to detect, recognize, and analyze human faces in images. We utilized face recognition and identification to differentiate between known and unknown faces The API is easy too train based on limited set of images available for personal in the surrounding. Based on number of known and unknown faces in a scene, the algorithm calculates the overall risk score. The system hence allows detection of entry and exit of personal in an image and hence calculation of actual risk score.
  • 13.
    13 Approach and Methodology •Datasets • Approach • Project Process Flow Approach and Methodology – Emotion Detection Object Detection Pose Estimation Emotion DetectionFacial Recognition Warning Model The team leverage state of the art, Xception CNN Model model by Google for identification of facial emotions. The algorithm allows recognition of 7 categories of human emotions in a very real-time fashion. The model has been trained on the Kaggle dataset for Facial emotion recognition (FER) dataset with over 32k images.
  • 14.
    14 Approach and Methodology •Datasets • Approach • Project Process Flow Approach and Methodology – Warning Model Object Detection Pose Estimation Emotion DetectionFacial Recognition Warning Model The team leverage custom Tensorflow DNN model for identification of warning instances. The model takes inputs from all 4 models and predicts the final warning score based on the same (0 being normal scene and 100 being highly vulnerable). The model has been trained on self-labeled images from Indian TV show based on crime incidences ("Saavdhaan India"). Hand Labelled Dataset (based on images extracted from movie scenes)
  • 15.
    Warning score model& API • The integrated model was used to process a series of pre-labelled images (hand labelled b/w 0-100 as overall warning score). This dataset was then used to train a final neural network, which was integrated into the final model to assess the final warning score of each scene Data Collection & Knowledge Gathering • We conducted a detailed literature review on speech and vision analysis including research papers • Familiarized with common datasets like COCO, KeyPoint etc. Project Process Flow & Progress 15 Approach and Methodology • Datasets • Approach • Project Process Flow Realizing pre-trained models (Transfer Learning) • We leveraged the concept of transfer learning and realized popular pre-existing models (or used their weights) to save on model re-training • YOLO-v3, VGG19, Xception are some models which we have used/realized in the process Model Integration • We integrated these re-tuned models into a single uniform model which takes images as inputs and outputs a stream of detections based on individually tuned models • The input can be a live feed using OpenCV or a pre-recorded video stream Transfer learning and semi- supervised learning saved a lot of training time for the team and eliminate the requirement of special GPUs for the project Stage -1 Stage -2 Stage -3 Stage -4 We leveraged a 4-step process to execute the project:
  • 16.
    Today’s Objectives 16 Problem Summaryand Key Objectives (3 minutes) Approach and Methodology (20 minutes) Results & Demonstration (7 minutes) Q & A (10 minutes)
  • 17.
    Time for ashort DEMO!! 17 Results & Insights
  • 18.
    Warp-up & FutureScope 18 Results & Insights The key results we saw from the DEMO: Objects Known vs Unknown Faces Emotions Actions SpeedScore Future Scope & other potential applications:  Fast and Accurate Results  Easy to deploy and use  Highly customizable  Multiple Applications Integrated Model Image - Processing (2.5~2.7 FPS) 1. Object Detection 2. Pose Estimation 3. Sentiment Detection 4. Facial Recognition Video Feed Warning Signals Pre-trained NN (Model X) Weights trained on basis of used case in consideration + YOLO v3 (VGG 16) (MS Cognitive API) (Xception v3) + Transfer Learning Retain algorithm based on intended application
  • 19.
    Today’s Objectives 19 Problem Summaryand Key Objectives (3 minutes) Approach and Methodology (20 minutes) Results & Demonstration (7 minutes) Q & A (10 minutes)
  • 20.
    20 We are opento any of your questions/queries