The document describes a project to perform object detection in videos. The team's scope was to identify, list, localize and bound objects in video frames using machine learning. They chose the MS-COCO dataset and the SSD model for its efficiency and speed at object detection. A comparative analysis found SSD_MOBILENET_V1_COCO to have the best balance of speed and accuracy. The team performed transfer learning to customize the model for new object types. They developed a web application using Flask that streams video frames from the client to perform object detection and returns bounding box coordinates.
2. Introduction
• Videos are basically multiple frames in a sequence which have several
objects in them at any given moment. Machine learning can be used
to identify these objects and make them searchable using tags.
3. Our Scope
• Identify objects in videos
• Listing objects
• Localizing them per frame and
• Bounding them with boxes
4. Our Approach - Dataset
• We chose our dataset based on observations of mean objects per
image. We observed that the maximum were in the MS-COCO
dataset.
5. Approach – Selecting Model
• There are several models available for making Convoluted Neural
Networks. Based on research we found that Faster R-CNN and The
SSD (Single-Shot Multibox Detector) are highly efficient at detecting
objects in frames.
• Based on comparitive results we decided to go with the SSD model,
with the coco-dataset.
6. Comparative Analysis
Model Name Speed (ms) COCO mAP [^1]
ssd_mobilenet_v1_coco 30 21
ssd_inception_v2_coco 42 24
faster_rcnn_inception_v2_coco 58 28
faster_rcnn_resnet50_coco 89 30
mAP is the mean average precision that is calculated for the basis of classification.
After the comparative analysis, we decided on using the SSD_MOBILENET_V1_COCO. Here are some details
about what we’re dealing with.
9. Single Shot Multibox Detection Specifics
• Takes inputs of 300x300
• Training requires image and the ground bounding boxes
• Performs non-maximum suppression internally
10. SSD v/s The Rest
On the basis of a different dataset, but proportions stay the same with COCO.
11. Transfer Learning
• We performed transfer learning over the SSD model, using Python,
LXML, LabelImg, Paperspace and Tensorflow.
• Steps involved were:
• Gathering Images for custom objects,
• Drawing bounding box for images,
• Generating an XML with dimensions for the bounding box,
• Using Tensorflow to train model on the object,
• Used Paperspace for utilizing a GPU.
• Used Tensorboard to monitor accuracy at various iterations.
12.
13.
14. How we made it
• We started off using openCV for capturing videos and rendering as
images.
• But openCV was harder to configure on cloud platforms as an API for
accessing web camera footage, which was a goal.
• So here’s what we followed.
Flask
Application
Client
Side
WebRTC Image
Stream
Start Object
Detection
Client
Side
Classify and
Box Images
Return
Coordinates
Render on
Browser using
JS
16. Further down the line
• This application can be used in inventory management using
computer vision. We see segmentation as a possibility for bring smart
checkouts to convenience stores that may not be as heavy on
infrastructure as Amazon or competition.
• Achieve better performance by pruning the model.
17. Work Allocation
• We split the work almost equally across all fields.
Priyesh Kaushik Pranay Mankad
Implementation 50% Implementation 50%
Model Training Custom Object Training
Web Interfacing 25% Web Interfacing 75%
Transfer Learning 75% Transfer Learning 24%
Documentation 50% Documentation 50%
Presentation 49% Presentation 49%