Computer Vision
Landscape :
Present and
Future
Sanghamitra Deb
Staff Data Scientist
Chegg Inc
Data Day Texas, 2023
Outline
• Images
• Enhanced Transcription
o Data Story
o Computer Vision model
o Metrics
o Deployment
• Computer Vision Landscape
• Image Embeddings
Images
Disclaimer: Images are replica’s representing real scenarios
Enhanced Transcription
Computer
Vision Model Transcription
Service
{”text”:”Resonant ocean thicknesses at different forcing frequencies. (a) Location of Europa's
first three largest resonant rotational-gravity modes as a function of forcing frequency and
ocean thickness, for both zonal (m = 0) and sectoral (m = 2) degree-2 modes…..”}
Reference paper: https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2020GL088317
Data Story
Version 1
• Collect Data on
cropped images
• Build object
Detection Model
• Measure
performance.
Version 3
Version 2
Version 1
Data Story
Version 1
• Collect Data on
cropped images
• Build object
Detection Model
• Measure
performance.
Performance : Not good enough
Lessons learned: CV models
cannot read, unless objects
are well defined and distinct
detection has a lot of errors
Version 3
Version 2
Data Story
Version 1
• Collect Data on
cropped images
• Build object
Detection Model
• Measure
performance.
Performance : Not good enough
Lessons learned: CV
models cannot read, unless
objects are well defined
and distinct detection has a
lot of errors
Version 3
Version 2
Redefine problem --- Detect
bounding boxes for
• Text
• Equations
• Diagrams and Charts
• UI Elements
• Tables
Performance was good.
This model is currently in production
Data Story
Version 1
• Collect Data on
cropped images
• Build object
Detection Model
• Measure
performance.
Performance : Not good enough
Lessons learned: CV
models cannot read, unless
objects are well defined
and distinct detection has a
lot of errors
Version 2
Redefine problem --- Detect
bounding boxes for
• Text
• Equations
• Diagrams and Charts
• UI Elements
• Tables
Performance was good.
This model is currently in production
Redefine problem ---
Downstream applications
need the text that was
getting cropped out.
• Header Region
• Side Region
• Footer Region
• Question Region
• UI Elements
Version 3
Data Story
Redefine problem --- Detect
bounding boxes for
• Text
• Equations
• Diagrams and Charts
• UI Elements
• Tables
Performance was good.
This model is currently in production
Version 1 Version 3
Version 2
• Collect Data on
cropped images
• Build object
Detection Model
• Measure
performance.
Performance : Not good enough
Lessons learned: CV
models cannot read, unless
objects are well defined
and distinct detection has a
lot of errors
Redefine problem ---
Downstream applications
need the text that was
getting cropped out.
• Header Region
• Side Region
• Footer Region
• Question Region
•
• UI Elements
• Text
• Equations
• Diagrams &
Charts
• Tables
Enhanced Transcription: Version 2
We are extracting Bounding Boxes.
• Text
• Equations
• Diagrams and Charts
• UI Elements
• Tables
Tables
Text
Enhanced Transcription: Version 2
Equations
UI Elements
Diagrams and Charts
Building Object Detection Model: Training Pipeline
What is object Detection
Metrics: Intersection over Union
Predictions: Bounding Boxes (BB), classification labels. IOU is computed for each bounding box
Metrics: mAP@iou=0.5
Metrics are computed for a given IOU threshold.
For a prediction, we may get different binary TRUE or
FALSE positives, by changing the IoU threshold.
Average precision is computed for each class for a threshold of 0.5. mAP is the mean across all classes.
mAP@iou=0.5 >=0.8
Collecting Training Data: LabelBox
Retrieve archival images .
Create annotation project.
Write annotation guide. Make sure 5-
10% of the data is reviewed for quality
checks.
Look for inter-annotator agreement
for a small dataset
Collect labelled data.
Do some spot checks for annotation
quality
Object Detection Models
Region-based Convolutional Neural Networks (R-
CNN)
Cons: Very slow --- propagating thousand’s of RP’s through CNN & classifier takes a very long time
Vanishing/Exploding Gradients
Operation --- multiplying n small / large numbers to compute gradients of the “front” layers in
an n-layer network
When the network is deep, multiplying n small numbers will become zero (vanished).
When the network is deep, multiplying n large numbers will become too large (exploded).
Resnet-2015
Right: Regular CNN, Left: fit some residual , instead of the desired function
H(X) directly. A skip / shortcut connection is added to the input x along with
the output after few weight layers
Layers can be stacked to be 150 layers deep
Plain Network vs RESNET
YOLO (You Only Look Once)
Unified Detection ---
• Uses features from the entire image for prediction
• Predicts Bounding boxes across all classes simultaneously.
• Bounding boxes and classes are predicted in one shot, i.e by
the same network.
Divide input into grids class probability map Final detections
Yolo v5 network
Why Yolo?
o Faster Speed: YOLO algorithms works comparatively faster as compared to other
algorithm. Smaller model is able to process 155 frames per second.
o Accurary: State of art performance on several Object Detection datasets including
COCO.
o Open source code is available in multiple deep learning frameworks.
o Code is well developed and easy to use.
Limitations: small objects that are grouped together do not have good recall
Yolo v5 Pytorch codebase
https://github.com/ultralytics/yolov5
Lets look into the repo
python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt
Model size
Batch size
python detect.py --weights yolov5s.pt --source image.jpg
Deployment
Load Pytorch
model & predict
Bounding Boxes
Crop image with
Bounding Box
output
Send cropped image
to transcription
service
API output:
{Transcribed text,
Bounding box }
Version 2
Version 3
Measuring effectiveness of the Enhanced Transcription
Annotation Task: Labelbox
Which Transcription is better?
Improves Coverage
If the entire image was send to the
transcription service more than 5% of the
images returned “no content found”.
Cropping the image using object detection
removes low quality surrounding elements,
this facilitates recovery of transcription for
2.7% of images
Computer Vision
Landscape
Diagram Embeddings
Pulley diagram
Newton’s second
law
Friction
acceleration
Moment of Inertia
Extract diagram embeddings from pre-trained modes such as Resnet.
Use case
• Similarity based applications --- recommendation systems.
• Converting general predictive model into multimodal models with text , image and structured data features.
• Categorizing diagrams and creating a diagram ontology to create rich metadata.
Takeaways
o Computer Vision models can see but they cannot read.
o Doing a deepdive on metrics ahead of building the model is a good practice.
o YOLO performs well out of the box. Its open source and readily available with
very low latency.
o Building service combining outputs from external vendors requires careful
load testing.
o Having a vision beyond immediate deliverables creates avenues for overall
enrichment of ML products.
Thank You
@sangha_deb
sdeb@chegg.com
References
• Computer Vision Models : https://medium.com/augmented-startups/top-6-object-detection-algorithms-b8e5c41b952f.
https://www.v7labs.com/blog/yolo-object-detection#h1
• https://towardsdatascience.com/map-mean-average-precision-might-confuse-you-5956f1bfa9e2
• R-FCN : https://arxiv.org/pdf/1605.06409.pdf
• YOLOV5 - https://arxiv.org/pdf/2108.11539.pdf

Computer Vision Landscape : Present and Future

  • 1.
    Computer Vision Landscape : Presentand Future Sanghamitra Deb Staff Data Scientist Chegg Inc Data Day Texas, 2023
  • 2.
    Outline • Images • EnhancedTranscription o Data Story o Computer Vision model o Metrics o Deployment • Computer Vision Landscape • Image Embeddings
  • 3.
    Images Disclaimer: Images arereplica’s representing real scenarios
  • 4.
    Enhanced Transcription Computer Vision ModelTranscription Service {”text”:”Resonant ocean thicknesses at different forcing frequencies. (a) Location of Europa's first three largest resonant rotational-gravity modes as a function of forcing frequency and ocean thickness, for both zonal (m = 0) and sectoral (m = 2) degree-2 modes…..”} Reference paper: https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2020GL088317
  • 5.
    Data Story Version 1 •Collect Data on cropped images • Build object Detection Model • Measure performance. Version 3 Version 2
  • 6.
  • 7.
    Data Story Version 1 •Collect Data on cropped images • Build object Detection Model • Measure performance. Performance : Not good enough Lessons learned: CV models cannot read, unless objects are well defined and distinct detection has a lot of errors Version 3 Version 2
  • 8.
    Data Story Version 1 •Collect Data on cropped images • Build object Detection Model • Measure performance. Performance : Not good enough Lessons learned: CV models cannot read, unless objects are well defined and distinct detection has a lot of errors Version 3 Version 2 Redefine problem --- Detect bounding boxes for • Text • Equations • Diagrams and Charts • UI Elements • Tables Performance was good. This model is currently in production
  • 9.
    Data Story Version 1 •Collect Data on cropped images • Build object Detection Model • Measure performance. Performance : Not good enough Lessons learned: CV models cannot read, unless objects are well defined and distinct detection has a lot of errors Version 2 Redefine problem --- Detect bounding boxes for • Text • Equations • Diagrams and Charts • UI Elements • Tables Performance was good. This model is currently in production Redefine problem --- Downstream applications need the text that was getting cropped out. • Header Region • Side Region • Footer Region • Question Region • UI Elements Version 3
  • 10.
    Data Story Redefine problem--- Detect bounding boxes for • Text • Equations • Diagrams and Charts • UI Elements • Tables Performance was good. This model is currently in production Version 1 Version 3 Version 2 • Collect Data on cropped images • Build object Detection Model • Measure performance. Performance : Not good enough Lessons learned: CV models cannot read, unless objects are well defined and distinct detection has a lot of errors Redefine problem --- Downstream applications need the text that was getting cropped out. • Header Region • Side Region • Footer Region • Question Region • • UI Elements • Text • Equations • Diagrams & Charts • Tables
  • 11.
    Enhanced Transcription: Version2 We are extracting Bounding Boxes. • Text • Equations • Diagrams and Charts • UI Elements • Tables Tables Text
  • 12.
    Enhanced Transcription: Version2 Equations UI Elements Diagrams and Charts
  • 13.
    Building Object DetectionModel: Training Pipeline
  • 14.
    What is objectDetection
  • 15.
    Metrics: Intersection overUnion Predictions: Bounding Boxes (BB), classification labels. IOU is computed for each bounding box
  • 16.
    Metrics: mAP@iou=0.5 Metrics arecomputed for a given IOU threshold. For a prediction, we may get different binary TRUE or FALSE positives, by changing the IoU threshold. Average precision is computed for each class for a threshold of 0.5. mAP is the mean across all classes. mAP@iou=0.5 >=0.8
  • 17.
    Collecting Training Data:LabelBox Retrieve archival images . Create annotation project. Write annotation guide. Make sure 5- 10% of the data is reviewed for quality checks. Look for inter-annotator agreement for a small dataset Collect labelled data. Do some spot checks for annotation quality
  • 18.
  • 19.
    Region-based Convolutional NeuralNetworks (R- CNN) Cons: Very slow --- propagating thousand’s of RP’s through CNN & classifier takes a very long time
  • 20.
    Vanishing/Exploding Gradients Operation ---multiplying n small / large numbers to compute gradients of the “front” layers in an n-layer network When the network is deep, multiplying n small numbers will become zero (vanished). When the network is deep, multiplying n large numbers will become too large (exploded).
  • 21.
    Resnet-2015 Right: Regular CNN,Left: fit some residual , instead of the desired function H(X) directly. A skip / shortcut connection is added to the input x along with the output after few weight layers Layers can be stacked to be 150 layers deep
  • 22.
  • 23.
    YOLO (You OnlyLook Once) Unified Detection --- • Uses features from the entire image for prediction • Predicts Bounding boxes across all classes simultaneously. • Bounding boxes and classes are predicted in one shot, i.e by the same network. Divide input into grids class probability map Final detections
  • 24.
  • 25.
    Why Yolo? o FasterSpeed: YOLO algorithms works comparatively faster as compared to other algorithm. Smaller model is able to process 155 frames per second. o Accurary: State of art performance on several Object Detection datasets including COCO. o Open source code is available in multiple deep learning frameworks. o Code is well developed and easy to use. Limitations: small objects that are grouped together do not have good recall
  • 26.
    Yolo v5 Pytorchcodebase https://github.com/ultralytics/yolov5 Lets look into the repo python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt Model size Batch size python detect.py --weights yolov5s.pt --source image.jpg
  • 27.
    Deployment Load Pytorch model &predict Bounding Boxes Crop image with Bounding Box output Send cropped image to transcription service API output: {Transcribed text, Bounding box } Version 2 Version 3
  • 28.
    Measuring effectiveness ofthe Enhanced Transcription Annotation Task: Labelbox Which Transcription is better?
  • 29.
    Improves Coverage If theentire image was send to the transcription service more than 5% of the images returned “no content found”. Cropping the image using object detection removes low quality surrounding elements, this facilitates recovery of transcription for 2.7% of images
  • 30.
  • 31.
    Diagram Embeddings Pulley diagram Newton’ssecond law Friction acceleration Moment of Inertia Extract diagram embeddings from pre-trained modes such as Resnet. Use case • Similarity based applications --- recommendation systems. • Converting general predictive model into multimodal models with text , image and structured data features. • Categorizing diagrams and creating a diagram ontology to create rich metadata.
  • 32.
    Takeaways o Computer Visionmodels can see but they cannot read. o Doing a deepdive on metrics ahead of building the model is a good practice. o YOLO performs well out of the box. Its open source and readily available with very low latency. o Building service combining outputs from external vendors requires careful load testing. o Having a vision beyond immediate deliverables creates avenues for overall enrichment of ML products.
  • 33.
  • 34.
    References • Computer VisionModels : https://medium.com/augmented-startups/top-6-object-detection-algorithms-b8e5c41b952f. https://www.v7labs.com/blog/yolo-object-detection#h1 • https://towardsdatascience.com/map-mean-average-precision-might-confuse-you-5956f1bfa9e2 • R-FCN : https://arxiv.org/pdf/1605.06409.pdf • YOLOV5 - https://arxiv.org/pdf/2108.11539.pdf

Editor's Notes

  • #15 Classification and Localization --- Done using regression
  • #20 Selective search. --- extract several thousand region proposals. Each of these region proposals (RP) is labeled with a class and a ground-truth bounding box. A pre-trained CNN is used to extract features for the region proposals through forward propagation. These features are used to predict the class and bounding box of this region proposal using SVMs and linear regression. ROI pooling is followed by fully connected (FC) layers for classification and bounding box regression. The FC layers after ROI pooling do not share among different ROIs and take time. This makes R-CNN approaches slow, and the fully connected layers have a large number of parameters. Fast R-CNN performs the CNN forward propagation once on the entire image. Faster R-CNN reduces the total number of region proposals by using a region proposal network(RPN) instead of selective search to further improve the speed.
  • #24 Yolo reasons globally about the full image … YOLO models treat object detection as a regression problem. It divides the image into an S × S grid and for each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities. These predictions are encoded as an S × S × (B ∗ 5 + C) tensor.
  • #25 The details of the architecture are beyond the scope of this presentation. YOLO V5 HAS improvements in data augmentation compared to previous models. Resnet is one of the backbones used for the architecture for extracting features. Transformers are used in the prediction head. Predictions from multiple heads are ensembled using techniques such as non-max suppression to predict the bounding boxes. Additionally a resnet model is trained using image patches cropping from training data as classification training set.
  • #28 Test for I/ contract Send images that have no text and check for the output. Make sure there is logging