Computer Vision Landscape : Present and Future

Computer Vision
Landscape :
Present and
Future
Sanghamitra Deb
Staff Data Scientist
Chegg Inc
Data Day Texas, 2023

Outline
• Images
• Enhanced Transcription
o Data Story
o Computer Vision model
o Metrics
o Deployment
• Computer Vision Landscape
• Image Embeddings

Images
Disclaimer: Images are replica’s representing real scenarios

Enhanced Transcription
Computer
Vision Model Transcription
Service
{”text”:”Resonant ocean thicknesses at different forcing frequencies. (a) Location of Europa's
first three largest resonant rotational-gravity modes as a function of forcing frequency and
ocean thickness, for both zonal (m = 0) and sectoral (m = 2) degree-2 modes…..”}
Reference paper: https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2020GL088317

Data Story
Version 1
• Collect Data on
cropped images
• Build object
Detection Model
• Measure
performance.
Version 3
Version 2

Data Story
Version 1
• Collect Data on
cropped images
• Build object
Detection Model
• Measure
performance.
Performance : Not good enough
Lessons learned: CV models
cannot read, unless objects
are well defined and distinct
detection has a lot of errors
Version 3
Version 2

Data Story
Version 1
• Collect Data on
cropped images
• Build object
Detection Model
• Measure
performance.
Lessons learned: CV
models cannot read, unless
objects are well defined
and distinct detection has a
lot of errors
Version 3
Version 2
Redefine problem --- Detect
bounding boxes for
• Text
• Equations
• Diagrams and Charts
• UI Elements
• Tables
Performance was good.
This model is currently in production

Data Story
Version 1
• Collect Data on
cropped images
• Build object
Detection Model
• Measure
performance.
Lessons learned: CV
lot of errors
Version 2
bounding boxes for
• Text
• Equations
• UI Elements
• Tables
Redefine problem ---
Downstream applications
need the text that was
getting cropped out.
• Header Region
• Side Region
• Footer Region
• Question Region
• UI Elements
Version 3

Data Story
bounding boxes for
• Text
• Equations
• UI Elements
• Tables
Version 1 Version 3
Version 2
• Collect Data on
cropped images
• Build object
Detection Model
• Measure
performance.
Lessons learned: CV
lot of errors
Redefine problem ---
Downstream applications
need the text that was
getting cropped out.
• Header Region
• Side Region
• Footer Region
• Question Region
•
• UI Elements
• Text
• Equations
• Diagrams &
Charts
• Tables

Enhanced Transcription: Version 2
We are extracting Bounding Boxes.
• Text
• Equations
• UI Elements
• Tables
Tables
Text

Enhanced Transcription: Version 2
Equations
UI Elements
Diagrams and Charts

Building Object Detection Model: Training Pipeline

Metrics: Intersection over Union
Predictions: Bounding Boxes (BB), classification labels. IOU is computed for each bounding box

Metrics: mAP@iou=0.5
Metrics are computed for a given IOU threshold.
For a prediction, we may get different binary TRUE or
FALSE positives, by changing the IoU threshold.
Average precision is computed for each class for a threshold of 0.5. mAP is the mean across all classes.
mAP@iou=0.5 >=0.8

Collecting Training Data: LabelBox
Retrieve archival images .
Create annotation project.
Write annotation guide. Make sure 5-
10% of the data is reviewed for quality
checks.
Look for inter-annotator agreement
for a small dataset
Collect labelled data.
Do some spot checks for annotation
quality

Region-based Convolutional Neural Networks (R-
CNN)
Cons: Very slow --- propagating thousand’s of RP’s through CNN & classifier takes a very long time

Vanishing/Exploding Gradients
Operation --- multiplying n small / large numbers to compute gradients of the “front” layers in
an n-layer network
When the network is deep, multiplying n small numbers will become zero (vanished).
When the network is deep, multiplying n large numbers will become too large (exploded).

Resnet-2015
Right: Regular CNN, Left: fit some residual , instead of the desired function
H(X) directly. A skip / shortcut connection is added to the input x along with
the output after few weight layers
Layers can be stacked to be 150 layers deep

YOLO (You Only Look Once)
Unified Detection ---
• Uses features from the entire image for prediction
• Predicts Bounding boxes across all classes simultaneously.
• Bounding boxes and classes are predicted in one shot, i.e by
the same network.
Divide input into grids class probability map Final detections

Why Yolo?
o Faster Speed: YOLO algorithms works comparatively faster as compared to other
algorithm. Smaller model is able to process 155 frames per second.
o Accurary: State of art performance on several Object Detection datasets including
COCO.
o Open source code is available in multiple deep learning frameworks.
o Code is well developed and easy to use.
Limitations: small objects that are grouped together do not have good recall

Yolo v5 Pytorch codebase
https://github.com/ultralytics/yolov5
Lets look into the repo
python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt
Model size
Batch size
python detect.py --weights yolov5s.pt --source image.jpg

Deployment
Load Pytorch
model & predict
Bounding Boxes
Crop image with
Bounding Box
output
Send cropped image
to transcription
service
API output:
{Transcribed text,
Bounding box }
Version 2
Version 3

Measuring effectiveness of the Enhanced Transcription
Annotation Task: Labelbox
Which Transcription is better?

Improves Coverage
If the entire image was send to the
transcription service more than 5% of the
images returned “no content found”.
Cropping the image using object detection
removes low quality surrounding elements,
this facilitates recovery of transcription for
2.7% of images

Diagram Embeddings
Pulley diagram
Newton’s second
law
Friction
acceleration
Moment of Inertia
Extract diagram embeddings from pre-trained modes such as Resnet.
Use case
• Similarity based applications --- recommendation systems.
• Converting general predictive model into multimodal models with text , image and structured data features.
• Categorizing diagrams and creating a diagram ontology to create rich metadata.

Takeaways
o Computer Vision models can see but they cannot read.
o Doing a deepdive on metrics ahead of building the model is a good practice.
o YOLO performs well out of the box. Its open source and readily available with
very low latency.
o Building service combining outputs from external vendors requires careful
load testing.
o Having a vision beyond immediate deliverables creates avenues for overall
enrichment of ML products.

Thank You
@sangha_deb
sdeb@chegg.com

References
• Computer Vision Models : https://medium.com/augmented-startups/top-6-object-detection-algorithms-b8e5c41b952f.
https://www.v7labs.com/blog/yolo-object-detection#h1
• https://towardsdatascience.com/map-mean-average-precision-might-confuse-you-5956f1bfa9e2
• R-FCN : https://arxiv.org/pdf/1605.06409.pdf
• YOLOV5 - https://arxiv.org/pdf/2108.11539.pdf

Computer Vision Landscape : Present and Future

Recommended

Recommended

More Related Content

Similar to Computer Vision Landscape : Present and Future

Similar to Computer Vision Landscape : Present and Future (20)

More from Sanghamitra Deb

More from Sanghamitra Deb (16)

Recently uploaded

Recently uploaded (20)

Computer Vision Landscape : Present and Future

Editor's Notes