Millions of people all around the world Learn with Chegg. Education at Chegg is powered by the depth and diversity of the content that we have. A huge part of our content is in form of images. These images could be uploaded by students or by content creators. Images contain text that is extracted using a transcription service. Very often uploaded images are noisy. This leads to irrelevant characters or words in the transcribed text. Using object detection techniques we develop a service that extracts the relevant parts of the image and uses a transcription service to get clean text. In the first part of the presentation, I will talk about building an object detection model using YOLO for cropping and masking images to obtain a cleaner text from transcription. YOLO is a deep learning object detection and recognition modeling framework that is able to produce highly accurate results with low latency. In the next part of my presentation, I will talk about the building the Computer Vision landscape at Chegg. Starting from images on academic materials that are composed of elements such as text, equations, diagrams we create a pipeline for extracting these image elements. Using state of the art deep learning techniques we create embeddings for these elements to enhance downstream machine learning models such as content quality and similarity.
2. Outline
• Images
• Enhanced Transcription
o Data Story
o Computer Vision model
o Metrics
o Deployment
• Computer Vision Landscape
• Image Embeddings
4. Enhanced Transcription
Computer
Vision Model Transcription
Service
{”text”:”Resonant ocean thicknesses at different forcing frequencies. (a) Location of Europa's
first three largest resonant rotational-gravity modes as a function of forcing frequency and
ocean thickness, for both zonal (m = 0) and sectoral (m = 2) degree-2 modes…..”}
Reference paper: https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2020GL088317
5. Data Story
Version 1
• Collect Data on
cropped images
• Build object
Detection Model
• Measure
performance.
Version 3
Version 2
7. Data Story
Version 1
• Collect Data on
cropped images
• Build object
Detection Model
• Measure
performance.
Performance : Not good enough
Lessons learned: CV models
cannot read, unless objects
are well defined and distinct
detection has a lot of errors
Version 3
Version 2
8. Data Story
Version 1
• Collect Data on
cropped images
• Build object
Detection Model
• Measure
performance.
Performance : Not good enough
Lessons learned: CV
models cannot read, unless
objects are well defined
and distinct detection has a
lot of errors
Version 3
Version 2
Redefine problem --- Detect
bounding boxes for
• Text
• Equations
• Diagrams and Charts
• UI Elements
• Tables
Performance was good.
This model is currently in production
9. Data Story
Version 1
• Collect Data on
cropped images
• Build object
Detection Model
• Measure
performance.
Performance : Not good enough
Lessons learned: CV
models cannot read, unless
objects are well defined
and distinct detection has a
lot of errors
Version 2
Redefine problem --- Detect
bounding boxes for
• Text
• Equations
• Diagrams and Charts
• UI Elements
• Tables
Performance was good.
This model is currently in production
Redefine problem ---
Downstream applications
need the text that was
getting cropped out.
• Header Region
• Side Region
• Footer Region
• Question Region
• UI Elements
Version 3
10. Data Story
Redefine problem --- Detect
bounding boxes for
• Text
• Equations
• Diagrams and Charts
• UI Elements
• Tables
Performance was good.
This model is currently in production
Version 1 Version 3
Version 2
• Collect Data on
cropped images
• Build object
Detection Model
• Measure
performance.
Performance : Not good enough
Lessons learned: CV
models cannot read, unless
objects are well defined
and distinct detection has a
lot of errors
Redefine problem ---
Downstream applications
need the text that was
getting cropped out.
• Header Region
• Side Region
• Footer Region
• Question Region
•
• UI Elements
• Text
• Equations
• Diagrams &
Charts
• Tables
11. Enhanced Transcription: Version 2
We are extracting Bounding Boxes.
• Text
• Equations
• Diagrams and Charts
• UI Elements
• Tables
Tables
Text
15. Metrics: Intersection over Union
Predictions: Bounding Boxes (BB), classification labels. IOU is computed for each bounding box
16. Metrics: mAP@iou=0.5
Metrics are computed for a given IOU threshold.
For a prediction, we may get different binary TRUE or
FALSE positives, by changing the IoU threshold.
Average precision is computed for each class for a threshold of 0.5. mAP is the mean across all classes.
mAP@iou=0.5 >=0.8
17. Collecting Training Data: LabelBox
Retrieve archival images .
Create annotation project.
Write annotation guide. Make sure 5-
10% of the data is reviewed for quality
checks.
Look for inter-annotator agreement
for a small dataset
Collect labelled data.
Do some spot checks for annotation
quality
19. Region-based Convolutional Neural Networks (R-
CNN)
Cons: Very slow --- propagating thousand’s of RP’s through CNN & classifier takes a very long time
20. Vanishing/Exploding Gradients
Operation --- multiplying n small / large numbers to compute gradients of the “front” layers in
an n-layer network
When the network is deep, multiplying n small numbers will become zero (vanished).
When the network is deep, multiplying n large numbers will become too large (exploded).
21. Resnet-2015
Right: Regular CNN, Left: fit some residual , instead of the desired function
H(X) directly. A skip / shortcut connection is added to the input x along with
the output after few weight layers
Layers can be stacked to be 150 layers deep
23. YOLO (You Only Look Once)
Unified Detection ---
• Uses features from the entire image for prediction
• Predicts Bounding boxes across all classes simultaneously.
• Bounding boxes and classes are predicted in one shot, i.e by
the same network.
Divide input into grids class probability map Final detections
25. Why Yolo?
o Faster Speed: YOLO algorithms works comparatively faster as compared to other
algorithm. Smaller model is able to process 155 frames per second.
o Accurary: State of art performance on several Object Detection datasets including
COCO.
o Open source code is available in multiple deep learning frameworks.
o Code is well developed and easy to use.
Limitations: small objects that are grouped together do not have good recall
27. Deployment
Load Pytorch
model & predict
Bounding Boxes
Crop image with
Bounding Box
output
Send cropped image
to transcription
service
API output:
{Transcribed text,
Bounding box }
Version 2
Version 3
28. Measuring effectiveness of the Enhanced Transcription
Annotation Task: Labelbox
Which Transcription is better?
29. Improves Coverage
If the entire image was send to the
transcription service more than 5% of the
images returned “no content found”.
Cropping the image using object detection
removes low quality surrounding elements,
this facilitates recovery of transcription for
2.7% of images
31. Diagram Embeddings
Pulley diagram
Newton’s second
law
Friction
acceleration
Moment of Inertia
Extract diagram embeddings from pre-trained modes such as Resnet.
Use case
• Similarity based applications --- recommendation systems.
• Converting general predictive model into multimodal models with text , image and structured data features.
• Categorizing diagrams and creating a diagram ontology to create rich metadata.
32. Takeaways
o Computer Vision models can see but they cannot read.
o Doing a deepdive on metrics ahead of building the model is a good practice.
o YOLO performs well out of the box. Its open source and readily available with
very low latency.
o Building service combining outputs from external vendors requires careful
load testing.
o Having a vision beyond immediate deliverables creates avenues for overall
enrichment of ML products.
Classification and Localization --- Done using regression
Selective search. --- extract several thousand region proposals.
Each of these region proposals (RP) is labeled with a class and a ground-truth bounding box.
A pre-trained CNN is used to extract features for the region proposals through forward propagation.
These features are used to predict the class and bounding box of this region proposal using SVMs and linear regression.
ROI pooling is followed by fully connected (FC) layers for classification and bounding box regression. The FC layers after ROI pooling do not share among different ROIs and take time. This makes R-CNN approaches slow, and the fully connected layers have a large number of parameters.
Fast R-CNN performs the CNN forward propagation once on the entire image.
Faster R-CNN reduces the total number of region proposals by using a region proposal network(RPN) instead of selective search to further improve the speed.
Yolo reasons globally about the full image …
YOLO models treat object detection as a regression problem. It divides the image into an S × S grid and for each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities. These predictions are encoded as an S × S × (B ∗ 5 + C) tensor.
The details of the architecture are beyond the scope of this presentation. YOLO V5 HAS improvements in data augmentation compared to previous models. Resnet is one of the backbones used for the architecture for extracting features. Transformers are used in the prediction head. Predictions from multiple heads are ensembled using techniques such as non-max suppression to predict the bounding boxes.
Additionally a resnet model is trained using image patches cropping from training data as classification training set.
Test for I/ contract
Send images that have no text and check for the output.
Make sure there is logging