This talk is about designing a Deep Learning model pipeline for OCR (Text Segmentation and Text Extraction) using Tensorflow, targeting domain specific information extraction
3. Flashback
- We deal with lot of text related problems/challenges like clustering/classification docs, information extraction etc..
- Form filling: https://github.com/Imaginea/i-tagger
- Infinity : Pramati level innovation competition
- 2018 with GANs @ https://github.com/dhiraa/asariri
- 2019 with Audio and CNNs
- Our exploration and code base were scattered.
- We organized all our explorations under one code base called VitaFlow
- Planned R&D
- Information Extraction with an aim to generalize i.e independent of dataset (Text + Image)
- Audio related Android applications (Tensorflow Lite + Audio + Android Applications) (need funding and
resources ;))
4. Introduction
- Problem
- A pipeline to train and deploy DL models
- Address domain specific information extraction
- Traditional IE depends on rule based engines
- Often not easily extensible for new data
- Solution
- Design a ML/DL model pipeline
- Plug and play modules at each stage
- Data sets
- Annotations
- Pre-processing / Post processing
- Training and serving the models
- Metrics to evaluate
- Feedback loop
5. Pipeline
1. Raw Images
2. Image Annotations (Bounding boxes)
3. Text Detection - Text Localisation / Document Orientation Analysis + Fix
a. EAST
b. DOCT2TEXT
4. Text-Cleaner/Binarization
5. Text Recognition - OCR
a. Calamari (CNN+LSTM models)
b. Tesseract
6. Text Annotations
7. ML / Statistical Inference / Rules
8. Domain Specific Extraction
9. Data Store
7. tan chay yee
0.2s4 JALAH HARMOHI
312
Date 09/0112019 8:01:11
PM
Total Amount : 31.00
….
EAST/FOTS
Information Extraction
- Rules
- Statistics Inference
- ML/DL Models
(Natural
Scene) Text
Segmentation
Image to Text
CNN +
LSTM
Models
(OCR)
Extract Text
Line segments
Vendor : tan chay yee
Total: 31
Date : 09/01/2019
Domain Specific Information Extraction
Positional Information
9. OCR
- OCR : Text Localization + Text Extraction
- Text Localization
- EAST (https://arxiv.org/abs/1704.03155)
- FOTS (https://arxiv.org/abs/1801.01671)
- Text Extraction
- Calamari (https://github.com/Calamari-OCR/calamari)
- Tesseract
10. ICDAR Dataset
- ICDAR 2015
Natural images with incidental scene text
- ICDAR 2019
Receipts and invoices
- What's unique about data preparation for OCR Text recognition?
- Its Text + Image
- Format of Images : JPEG or PNG
- Ground truth :
● One text file per image,
● UTF-8 format
● Each line specifies the coordinates of one word's bounding box and its transcription in a comma
separated format
img_01.png <-> img_01.txt
x1_1, y1_1,x2_1,y2_1,x3_1,y3_1,x4_1,y4_1, transcript_1
x1_2,y1_2,x2_2,y2_2,x3_2,y3_2,x4_2,y4_2, transcript_2
x1_3,y1_3,x2_3,y2_3,x3_3,y3_3,x4_3,y4_3, transcript_3
11. Data Preparation
Input:
- RGB color image (height×width×3) or a grayscale image (height×width×1)
Output
- Image matrix (height×width×3)
- Score map matrix (height×width×1) : Distance to the nearest vertex
- Geometry map matrix (height×width×5) : Bit complicated, expect a post on this soon!
https://www.jeremyjordan.me/semantic-segmentation/
15. - http://ethereon.github.io/netscope/#/gist/db945b393d40bfa26006
- http://teleported.in/posts/decoding-resnet-architecture/
- Increase the depth of the layer without affecting its generalization power
- The network can be mathematically depicted as:
H(x) = F(x) + x, where F(x) = W2*relu(W1*x+b1)+b2
- During training period, the residual network learns the weights of its layers such that if the identity mapping were
optimal, all the weights get set to 0. In effect F(x) become 0, as in x gets directly mapped to H(x) and no
corrections need to be made. Hence these become your identity mappings which help grow the network deep.
And if there is a deviation from optimal identity mapping, weights and biases of F(x) are learned to adjust for it.
Think of F(x) as learning how to adjust our predictions to match the actuals.
ResNet
16. EAST
- An Efficient and Accurate Scene Text Detector
- No image matrix algorithms involved like edge detection, filtering, smoothening etc.,
- Character and word segmentation graphs
- Basically no complicated algorithms
- Detects text in an image and videos
- Geometry and confidence scores for the detected text.
- The network architecture is based on U-Net.
- Feed forward “stem” of this network may vary
- PVANet, VGG16 used in the paper
- Our pipeline uses Resnet
- A popular text detector- Got adopted by OpenCV library.
19. These skip connections from earlier layers in the network (prior to a downsampling operation) should provide
the necessary detail in order to reconstruct accurate shapes for segmentation boundaries.
20. Loss Function
- Cross entropy loss, won’t work
efficiently in this case as one
segmentation can dominate
other
- Dice Loss
- Where |A∩B|
represents the common
elements between sets A
and B, and |A|
represents the number
of elements in set A (and
likewise for set B).
28. Takeaways
- Deliver model pipeline, not just models
- Make the pipeline debuggable at each stage
- Provide feedback loop, so that humans can aid the whole process
- Extracting text images has its own challenges
- Identifying the text from its background
- Varying size and fonts
- Similar looking characters (o/0, y/g)
- Recovering text from scanned/aged images
- Multi oriented text
- Black and white images
29. VitaFlow
For more information
● Code: https://github.com/Imaginea/vitaFlow
** We are in the process of piecing together all our individual efforts as a pipeline
** ReadME will be updated shortly with end to end replication of this talk
Thank You! ** Conditions Apply ;)