SlideShare a Scribd company logo
PRODUCTS
Object Detection
Image Classification
SOLUTIONS
NSFW
Drones
E-commerce
Inspection
Multi Label Classification
OCR API
Hygiene & Safety Compliance
Insurance
CASE STUDIES
Counting Cars
Solar Panel faults
Windmill faults
NSFW Content Moderation
Furniture Research & Recommendation
COMPANY
About Us
Blog
CONTACT
156 2nd Street, San Francisco, ​CA 94105, USA
+1 650 382 8676
info@nanonets.com
Copyright © 2018 NanoNet Technologies Inc. All rights reserved. Terms of Use & Privacy Policy
O C R A T T E N T I O N - O C RT E N S O R F L O W
V I S U A L
A T T E N T I O NO P T I C A L C H A R A C T E R R E C O G N I T I O N
D E E P
L E A R N I N G
Building Custom Deep Learning Based OCR models
Introduction
OCR provides us with different ways to see an image, find and recognize the text in it. When
we think about OCR, we inevitably think of lots of paperwork - bank cheques and legal
documents, ID cards and street signs. In this blog post, we will try to predict the text present
in number plate images.
What we are dealing with is an optical character recognition library that leverages deep
learning and attention mechanism to make predictions about what a particular character or
word in an image is, if there is one at all. Lots of big words thrown there, so we'll take it step
by step and explore the state of OCR technology and different approaches used for these
tasks.
You can always directly skip to the code section of the article or check thegithub repository if
you are familiar with the big words above.
OCR - Optical Character Recognition
Optical character recognition or OCR refers to a set of computer vision problems that require
us to convert images of digital or hand-written text images to machine readable text in a
form your computer can process, store and edit as a text file or as a part of a data entry and
manipulation software. The images can include documents, invoices, legal forms, ID cards or
OCR in the wild like reading street signs, shipping container numbers or vehicle number
plates.
Anuj Sable
People have tried solving the OCR problem with several conventional computer vision
techniques like image filters, contour detection and image classification which performed
well on narrow, template based datasets which did not vary much in their orientation, image
quality, etc but to make our models robust to these variations so that a business can deploy
their machine learning applications at scale, new methods have to be explored.
There are a lot of services and products that perform differently on different kinds of OCR
tasks. If you are interested, here's a blog post about where these OCR APIs might fail and
how can they improve.
Deep Learning and OCR
Deep learning approaches have improved over the last few years, reviving an interest in the
OCR problem, where neural networks can be used to combine the tasks of localizing text in
an image along with understanding what the text is. Using deep convolutional neural
architectures and attention mechanisms and recurrent networks have gone a long way in
this regard.
One of these deep learning approaches is the basis of Attention - OCR, the library we are
going to be using to predict the text in number plate images.
Think of it like this. The overall pipeline for many architectures for OCR tasks follow this
template - a convolutional network to extract image features as encoded vectors followed by
a recurrent network that uses these encoded features to predict where each of the letters in
the image text might be and what they are.
Let's try to understand what's going on under the hood.
Attention Mechanisms
You might be aware of RNNs or LSTMs, neural network architectures that predict output at
each time step, providing us with sequence generation as we need for language. This breed
of neural networks intended to learn patterns in sequential data by modifying their current
state based on current input and previous states iteratively. But due to limitations on
memory and issues like vanishing gradients, we found RNNs and LSTMs not able to really
capture the influence of words farther away.
Attention mechanism tries to fix this. It is a way to get your model learn long range
dependencies in a sequence and has found several applications in natural language
processing and machine translation.
BERT attention visualisation - source
In a nutshell, attention is a feed-forward layer with trainable weights that help us capture the
relationships between different elements of sequences. It works by using query, key and
value matrices, passing the input embeddings through a series of operations and getting an
encoded representation of our original input sequence.
calculating encoded representations our input embeddings (x) with key, value, query matrices -source
There are flavors to attention mechanisms. They can be hard or soft attention depending on
whether the entire image is available to the attention or only a patch. Having soft attention
by laying each patch smoothly over the sequence makes it differentiable, but hurts the time
taken to run computations. A better explanation can be found here.
Transformers
You might have heard of BERT, GPT2 or more recently XLNet performing a little too well on
language modelling and generation tasks. The secret sauce is the different ways of applying
transformers.
source
If you understand how attention works, it shouldn't take much effort to grasp how
transformers work. In essence, the paper uses multi-headed attention, which is nothing but
using several query, key and value matrices and training them independently, concatenating
them and then extracting a useable matrix for our following network by using an additional
set of weights.
Another important addition is a positional embedding that encodes the time at which an
element in a sequence appears. These positional embeddings are added to our input
embeddings for the network to learn time dependencies better. This article is an amazing
resource to learn about the mathematics behind self-attention and transformers.
Visual Attention
Though attention and transformer networks evolved for applications in the NLP domain, they
have been adapted for convolutional networks to replicate attention mechanisms of the
human brain and how it processes vision. To learn more, check this link or this study. The
fundamental behind this is to replicate how the human eye works.
When you open your eyes to a new scene, some parts of the picture directly catch your
'attention'. You focus on those parts of the picture first, extract information from it and
comprehend it. This information also guides your search for the next point of attention.
This method of watering down an image into it's most important components is the basis of
visual attention models. The process of finding the next attention point is seen as a
sequential task on convolutional features extracted from the image.
RAM - Recurrent Attention Model
This paper approaches the problem of attention by using reinforcement learning to model
how the human eye works. It defines a glimpse vector that extracts features of an image
around a certain location.
Several such glimpse vectors extracting features from a different sized crop of the image
around a common centre are then resized and converted to a constant resolution. These
glimpse vectors are flattened and passed through the glimpse network to obtain a vector
representation based on visual attention.
A) Glimpse sensor B) Glimpse network takes and image and location coordinates, crops extract different sized features
around the location and resizes them for further processing C)These resized fixed length feature vectors are passed to an
RNN which generates the next location for to pay attention to. source
Following this, there is a Location Network which utilises an RNN to predict which part of the
image our algorithm should pay attention to next. This predicted location becomes the next
input for your glimpse network. This is a stochastic process which helps us balance
exploration and exploitation while we are back-propagating our network to maximize our
rewards. The back-propagation is done using the REINFORCE policy gradient on the log-
likelihood of the attention score.
DRAM - Deep Recurrent Attention Model
Instead of using a single RNN, DRAM uses two RNNs - a location RNN to predict the next
glimpse location and another Classification RNN dedicated to predicting the class labels or
guess which character is it we are looking at in the text. A context network is used to
downsample image inputs for more generalisable RNN states. It also chooses to refer to the
location network in RAM as Emission Network. The training is done using an accumulated
reward and optimizing the sequence log-likelihood loss function using the REINFORCE policy
gradient.
The DRAM model - source
CRNN - Convolutional Recurrent Neural Networks
CRNNs don't treat our OCR task as a reinforcement learning problem but as a machine
learning problem with a custom loss. The loss used is called CTC loss - Connectionist
Temporal Classification. The convolutional layers are used as feature extractors that pass
these features to the recurrent layers - bi-directional LSTMs . These are followed by a
transcription layer that uses a probabilistic approach to decode our LSTM outputs. Each
frame generated by the LSTM is decoded into a character and these characters are fed into a
final decoder/transcription layer which will output the final predicted sequence.
source: https://arxiv.org/pdf/1507.05717.pdf
Spatial Transformer Networks
Spatial Transformer Networks, introduced in this paper, augment input images by applying
affine transformations so that the trained model is robust to variations in data.
source
The network consists of a localisation net, a grid generator and a sampler. The localisation
net takes an input image and gives us the parameters for the transformation we want to
apply on it. The grid generator uses a desired output template, multiplies it with the
parameters obtained from the localisation net and brings us the location of the point we
want to apply the transformation at to get the desired result. A bilinear sampling kernel is
finally used to generate our transformed feature maps.
Attention OCR
Attention-OCR is an OCR project available on tensorflow as an implementation of this paper
and came into being as a way to solve the image captioning problem. It can be thought of as
a CRNN followed by an attention decoder.
https://arxiv.org/pdf/1609.04938v2.pdf
First we use layers of convolutional networks to extract encoded image features. These
extracted features are then encoded to strings and passed through a recurrent network for
the attention mechanism to process. The attention mechanism used in the implementation is
borrowed from the Seq2Seq machine translation model. We use this attention based
decoder to finally predict the text in our image.
Building your own Attention OCR model
We will use attention-ocr to train a model on a set of images of number plates along with
their labels - the text present in the number plates and the bounding box coordinates of
those number plates. The dataset was acquired from here.
The steps followed are summarized here:
1. Gather annotated training data
2. Get crops for each frame of each video where the number plates are.
3. Generate tfrecords for all the cropped files.
4. Place them in models/research/attention_ocr/python/datasets as required (in the FSNS
dataset format). Follow thislink or the following sections of this blog.
5. Train the model using Attention OCR.
6. Make prediction on your own cropped images.
Or you can explore the Nanonets API where all you have to do is upload annotated images
and let the platform handle the rest for you. More about this in the final section.
This blog will run you through everything you need to train and make predictions using
tensorflow attention-ocr. Full code available here.
Getting training data
We have images of number plates but we do not have the text in them or the bounding box
numbers of the number plates in these images. Use an annotation tool to get your
annotations and save them in a .csv file.
Get crops
We have stored our bounding box data as a .csv file. The .csv file has the following fields:
1. files
2. text
3. xmin
4. xmax
5. ymin
6. ymax
To crop the images and get only the cropped window we have to deal with different sized
images. To do this we read the csv data in as a pandas dataframe and get our coordinates in
such a way that we don't miss any information about the number plates while also
maintaining a constant size of the crops. This will prove helpful when we are training our
OCR model.
Generate tfrecords
Having stored our cropped images of equal sizes in a different directory, we can begin using
those images to generate tfrecords that we will use to train our dataset. The script to
generate tfrecords can be found in the repository shared above. These tfrecords along with
the label mapping have to be stored in the tensorflow object detection API inside the
following directory -
DATA_PATHDATA_PATH == 'models/research/attention_ocr/python/datasets/data/number_plates''models/research/attention_ocr/python/datasets/data/number_plates'
The dataset has to be in the FSNS dataset format.
For this, your test and train tfrecords along with the charset labels text file are placed inside
a folder named 'fsns' inside the 'datasets' directory. you can change this to another folder
and upload your tfrecord files and charset-labels.txt here. You'll have to change the path in
multiple places accordingly. I have used a directory called 'number_plates' inside the
datasets/data directory.
Generate tf records by running the following script.
Setting our Attention-OCR up
Once we have our tfrecords and charset labels stored in the required directory, we need to
write a dataset config script that will help us split our data into train and test for the
attention OCR training script to process.
Make a python file and name it 'number_plates.py' and place it inside the following
directory:
'models/research/attention_ocr/python/datasets''models/research/attention_ocr/python/datasets'
The contents of the number-plates.py can be found in the README.md file here.
Also change the __init__.py file in the datasets directory to include the number_plates.py
script.
Train the model
Move into the following directory:
modelsmodels//researchresearch//attention_ocrattention_ocr
Open the file named 'common_flags.py' and specify where you'd want to log your training.
and run the following command on your terminal:
# change this if you changed the dataset name in the# change this if you changed the dataset name in the
# number_plates.py script or if you want to change the# number_plates.py script or if you want to change the
# number of epochs# number of epochs
python train.py --dataset_namepython train.py --dataset_name==number_plates --max_number_of_stepsnumber_plates --max_number_of_steps==30003000
Evaluate the model
Run the following command from terminal.
python eval.py --dataset_namepython eval.py --dataset_name=='number_plates''number_plates'
Get predictions
Now from the same directory run the following command on your shell.
python demo_inference.py --dataset_namepython demo_inference.py --dataset_name==number_plates --batch_sizenumber_plates --batch_size==8, 8, 
--checkpoint--checkpoint=='models/research/attention_ocr/number_plates_model_logs/model.ckpt-6000''models/research/attention_ocr/number_plates_model_logs/model.ckpt-6000', , 
--image_path_pattern--image_path_pattern==/home/anuj/crops/%d.png/home/anuj/crops/%d.png
We learned about attention mechanism, transformers, different ways visual attention is
applied - RAM, DRAM and CRNNs. We learned about STNs. Finally we learned about the deep
learning approach we used - Attention OCR.
From a programming perspective, we learnt how to use attention OCR to train it on your
own dataset and run inference using a trained model. The code can be found here and in my
attention-ocr fork.
There's of course a better, much simpler and more intuitive
way to do this.
OCR with Nanonets
The Nanonets OCR API allows you to build OCR models with ease. You can upload your data,
annotate it, set the model to train and wait for getting predictions through a browser based
UI without writing a single line of code, worrying about GPUs or finding the right
architectures for your deep learning models. You can also acquire the json responses of each
prediction to integrate it with your own systems and build machine learning powered apps
built on state of the art algorithms and a strong infrastructure.
Using the GUI: https://app.nanonets.com/
You can also use the Nanonets-OCR API by following the steps below:
Using NanoNets API
Below, we will give you a step-by-step guide to training your own model using the Nanonets
API, in 9 simple steps.
Step 1: Clone the Repo
git clone https://github.com/NanoNets/nanonets-ocr-sample-python
cd nanonets-ocr-sample-python
sudo pip install requests
sudo pip install tqdm
Step 2: Get your free API Key
Get your free API Key fromhttps://app.nanonets.com/#/keys
Step 3: Set the API key as an Environment Variable
export NANONETS_API_KEY=YOUR_API_KEY_GOES_HERE
Step 4: Create a New Model
python ./code/create-model.py
Note: This generates a MODEL_ID that you need for the next step
Step 5: Add Model Id as Environment Variable
export NANONETS_MODEL_ID=YOUR_MODEL_ID
Step 6: Upload the Training Data
Collect the images of object you want to detect. Once you have dataset ready in folder
images (image files), start uploading the dataset.
python ./code/upload-training.py
Step 7: Train Model
Once the Images have been uploaded, begin training the Model
python ./code/train-model.py
Step 8: Get Model State
The model takes ~30 minutes to train. You will get an email once the model is trained. In the
meanwhile you check the state of the model
watch -n 100 python ./code/model-state.py
Step 9: Make Prediction
Once the model is trained. You can make predictions using the model
python ./code/prediction.py PATH_TO_YOUR_IMAGE.jpg
N E W E R P O S T
A u t o m a t i n g I n v o i c e P r o c e s s i n g w i t h O C R a
O L D E R P O S T
D e e p L e a r n i n g B a s e d O C R f o r T e x t i n t
Building Custom Deep Learning Based OCR models using Nanonets

More Related Content

Recently uploaded

Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 

Recently uploaded (20)

Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 

Featured

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Project for Public Spaces & National Center for Biking and Walking
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn
 

Featured (20)

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
 

Building Custom Deep Learning Based OCR models using Nanonets

  • 1. PRODUCTS Object Detection Image Classification SOLUTIONS NSFW Drones E-commerce Inspection Multi Label Classification OCR API Hygiene & Safety Compliance Insurance CASE STUDIES Counting Cars Solar Panel faults Windmill faults NSFW Content Moderation Furniture Research & Recommendation COMPANY About Us Blog CONTACT 156 2nd Street, San Francisco, ​CA 94105, USA +1 650 382 8676 info@nanonets.com Copyright © 2018 NanoNet Technologies Inc. All rights reserved. Terms of Use & Privacy Policy O C R A T T E N T I O N - O C RT E N S O R F L O W V I S U A L A T T E N T I O NO P T I C A L C H A R A C T E R R E C O G N I T I O N D E E P L E A R N I N G Building Custom Deep Learning Based OCR models Introduction OCR provides us with different ways to see an image, find and recognize the text in it. When we think about OCR, we inevitably think of lots of paperwork - bank cheques and legal documents, ID cards and street signs. In this blog post, we will try to predict the text present in number plate images. What we are dealing with is an optical character recognition library that leverages deep learning and attention mechanism to make predictions about what a particular character or word in an image is, if there is one at all. Lots of big words thrown there, so we'll take it step by step and explore the state of OCR technology and different approaches used for these tasks. You can always directly skip to the code section of the article or check thegithub repository if you are familiar with the big words above. OCR - Optical Character Recognition Optical character recognition or OCR refers to a set of computer vision problems that require us to convert images of digital or hand-written text images to machine readable text in a form your computer can process, store and edit as a text file or as a part of a data entry and manipulation software. The images can include documents, invoices, legal forms, ID cards or OCR in the wild like reading street signs, shipping container numbers or vehicle number plates. Anuj Sable
  • 2. People have tried solving the OCR problem with several conventional computer vision techniques like image filters, contour detection and image classification which performed well on narrow, template based datasets which did not vary much in their orientation, image quality, etc but to make our models robust to these variations so that a business can deploy their machine learning applications at scale, new methods have to be explored. There are a lot of services and products that perform differently on different kinds of OCR tasks. If you are interested, here's a blog post about where these OCR APIs might fail and how can they improve. Deep Learning and OCR Deep learning approaches have improved over the last few years, reviving an interest in the OCR problem, where neural networks can be used to combine the tasks of localizing text in an image along with understanding what the text is. Using deep convolutional neural architectures and attention mechanisms and recurrent networks have gone a long way in this regard. One of these deep learning approaches is the basis of Attention - OCR, the library we are going to be using to predict the text in number plate images. Think of it like this. The overall pipeline for many architectures for OCR tasks follow this template - a convolutional network to extract image features as encoded vectors followed by a recurrent network that uses these encoded features to predict where each of the letters in the image text might be and what they are. Let's try to understand what's going on under the hood. Attention Mechanisms You might be aware of RNNs or LSTMs, neural network architectures that predict output at each time step, providing us with sequence generation as we need for language. This breed
  • 3. of neural networks intended to learn patterns in sequential data by modifying their current state based on current input and previous states iteratively. But due to limitations on memory and issues like vanishing gradients, we found RNNs and LSTMs not able to really capture the influence of words farther away. Attention mechanism tries to fix this. It is a way to get your model learn long range dependencies in a sequence and has found several applications in natural language processing and machine translation. BERT attention visualisation - source In a nutshell, attention is a feed-forward layer with trainable weights that help us capture the relationships between different elements of sequences. It works by using query, key and value matrices, passing the input embeddings through a series of operations and getting an encoded representation of our original input sequence.
  • 4. calculating encoded representations our input embeddings (x) with key, value, query matrices -source There are flavors to attention mechanisms. They can be hard or soft attention depending on whether the entire image is available to the attention or only a patch. Having soft attention by laying each patch smoothly over the sequence makes it differentiable, but hurts the time taken to run computations. A better explanation can be found here. Transformers You might have heard of BERT, GPT2 or more recently XLNet performing a little too well on language modelling and generation tasks. The secret sauce is the different ways of applying transformers.
  • 5. source If you understand how attention works, it shouldn't take much effort to grasp how transformers work. In essence, the paper uses multi-headed attention, which is nothing but using several query, key and value matrices and training them independently, concatenating them and then extracting a useable matrix for our following network by using an additional set of weights. Another important addition is a positional embedding that encodes the time at which an element in a sequence appears. These positional embeddings are added to our input embeddings for the network to learn time dependencies better. This article is an amazing resource to learn about the mathematics behind self-attention and transformers. Visual Attention Though attention and transformer networks evolved for applications in the NLP domain, they have been adapted for convolutional networks to replicate attention mechanisms of the human brain and how it processes vision. To learn more, check this link or this study. The fundamental behind this is to replicate how the human eye works. When you open your eyes to a new scene, some parts of the picture directly catch your 'attention'. You focus on those parts of the picture first, extract information from it and comprehend it. This information also guides your search for the next point of attention.
  • 6. This method of watering down an image into it's most important components is the basis of visual attention models. The process of finding the next attention point is seen as a sequential task on convolutional features extracted from the image. RAM - Recurrent Attention Model This paper approaches the problem of attention by using reinforcement learning to model how the human eye works. It defines a glimpse vector that extracts features of an image around a certain location. Several such glimpse vectors extracting features from a different sized crop of the image around a common centre are then resized and converted to a constant resolution. These glimpse vectors are flattened and passed through the glimpse network to obtain a vector representation based on visual attention. A) Glimpse sensor B) Glimpse network takes and image and location coordinates, crops extract different sized features around the location and resizes them for further processing C)These resized fixed length feature vectors are passed to an RNN which generates the next location for to pay attention to. source Following this, there is a Location Network which utilises an RNN to predict which part of the image our algorithm should pay attention to next. This predicted location becomes the next input for your glimpse network. This is a stochastic process which helps us balance
  • 7. exploration and exploitation while we are back-propagating our network to maximize our rewards. The back-propagation is done using the REINFORCE policy gradient on the log- likelihood of the attention score. DRAM - Deep Recurrent Attention Model Instead of using a single RNN, DRAM uses two RNNs - a location RNN to predict the next glimpse location and another Classification RNN dedicated to predicting the class labels or guess which character is it we are looking at in the text. A context network is used to downsample image inputs for more generalisable RNN states. It also chooses to refer to the location network in RAM as Emission Network. The training is done using an accumulated reward and optimizing the sequence log-likelihood loss function using the REINFORCE policy gradient. The DRAM model - source CRNN - Convolutional Recurrent Neural Networks CRNNs don't treat our OCR task as a reinforcement learning problem but as a machine learning problem with a custom loss. The loss used is called CTC loss - Connectionist Temporal Classification. The convolutional layers are used as feature extractors that pass these features to the recurrent layers - bi-directional LSTMs . These are followed by a transcription layer that uses a probabilistic approach to decode our LSTM outputs. Each frame generated by the LSTM is decoded into a character and these characters are fed into a final decoder/transcription layer which will output the final predicted sequence.
  • 8. source: https://arxiv.org/pdf/1507.05717.pdf Spatial Transformer Networks Spatial Transformer Networks, introduced in this paper, augment input images by applying affine transformations so that the trained model is robust to variations in data.
  • 9. source The network consists of a localisation net, a grid generator and a sampler. The localisation net takes an input image and gives us the parameters for the transformation we want to apply on it. The grid generator uses a desired output template, multiplies it with the parameters obtained from the localisation net and brings us the location of the point we want to apply the transformation at to get the desired result. A bilinear sampling kernel is finally used to generate our transformed feature maps. Attention OCR Attention-OCR is an OCR project available on tensorflow as an implementation of this paper and came into being as a way to solve the image captioning problem. It can be thought of as a CRNN followed by an attention decoder.
  • 10. https://arxiv.org/pdf/1609.04938v2.pdf First we use layers of convolutional networks to extract encoded image features. These extracted features are then encoded to strings and passed through a recurrent network for the attention mechanism to process. The attention mechanism used in the implementation is borrowed from the Seq2Seq machine translation model. We use this attention based decoder to finally predict the text in our image. Building your own Attention OCR model We will use attention-ocr to train a model on a set of images of number plates along with their labels - the text present in the number plates and the bounding box coordinates of those number plates. The dataset was acquired from here. The steps followed are summarized here: 1. Gather annotated training data 2. Get crops for each frame of each video where the number plates are. 3. Generate tfrecords for all the cropped files. 4. Place them in models/research/attention_ocr/python/datasets as required (in the FSNS dataset format). Follow thislink or the following sections of this blog.
  • 11. 5. Train the model using Attention OCR. 6. Make prediction on your own cropped images. Or you can explore the Nanonets API where all you have to do is upload annotated images and let the platform handle the rest for you. More about this in the final section. This blog will run you through everything you need to train and make predictions using tensorflow attention-ocr. Full code available here. Getting training data We have images of number plates but we do not have the text in them or the bounding box numbers of the number plates in these images. Use an annotation tool to get your annotations and save them in a .csv file. Get crops We have stored our bounding box data as a .csv file. The .csv file has the following fields: 1. files 2. text 3. xmin 4. xmax 5. ymin 6. ymax To crop the images and get only the cropped window we have to deal with different sized
  • 12. images. To do this we read the csv data in as a pandas dataframe and get our coordinates in such a way that we don't miss any information about the number plates while also maintaining a constant size of the crops. This will prove helpful when we are training our OCR model. Generate tfrecords Having stored our cropped images of equal sizes in a different directory, we can begin using those images to generate tfrecords that we will use to train our dataset. The script to generate tfrecords can be found in the repository shared above. These tfrecords along with the label mapping have to be stored in the tensorflow object detection API inside the following directory - DATA_PATHDATA_PATH == 'models/research/attention_ocr/python/datasets/data/number_plates''models/research/attention_ocr/python/datasets/data/number_plates' The dataset has to be in the FSNS dataset format. For this, your test and train tfrecords along with the charset labels text file are placed inside a folder named 'fsns' inside the 'datasets' directory. you can change this to another folder and upload your tfrecord files and charset-labels.txt here. You'll have to change the path in multiple places accordingly. I have used a directory called 'number_plates' inside the datasets/data directory. Generate tf records by running the following script. Setting our Attention-OCR up Once we have our tfrecords and charset labels stored in the required directory, we need to write a dataset config script that will help us split our data into train and test for the attention OCR training script to process. Make a python file and name it 'number_plates.py' and place it inside the following directory: 'models/research/attention_ocr/python/datasets''models/research/attention_ocr/python/datasets' The contents of the number-plates.py can be found in the README.md file here. Also change the __init__.py file in the datasets directory to include the number_plates.py script. Train the model Move into the following directory: modelsmodels//researchresearch//attention_ocrattention_ocr Open the file named 'common_flags.py' and specify where you'd want to log your training. and run the following command on your terminal:
  • 13. # change this if you changed the dataset name in the# change this if you changed the dataset name in the # number_plates.py script or if you want to change the# number_plates.py script or if you want to change the # number of epochs# number of epochs python train.py --dataset_namepython train.py --dataset_name==number_plates --max_number_of_stepsnumber_plates --max_number_of_steps==30003000 Evaluate the model Run the following command from terminal. python eval.py --dataset_namepython eval.py --dataset_name=='number_plates''number_plates' Get predictions Now from the same directory run the following command on your shell. python demo_inference.py --dataset_namepython demo_inference.py --dataset_name==number_plates --batch_sizenumber_plates --batch_size==8, 8, --checkpoint--checkpoint=='models/research/attention_ocr/number_plates_model_logs/model.ckpt-6000''models/research/attention_ocr/number_plates_model_logs/model.ckpt-6000', , --image_path_pattern--image_path_pattern==/home/anuj/crops/%d.png/home/anuj/crops/%d.png We learned about attention mechanism, transformers, different ways visual attention is applied - RAM, DRAM and CRNNs. We learned about STNs. Finally we learned about the deep learning approach we used - Attention OCR. From a programming perspective, we learnt how to use attention OCR to train it on your own dataset and run inference using a trained model. The code can be found here and in my attention-ocr fork. There's of course a better, much simpler and more intuitive way to do this. OCR with Nanonets The Nanonets OCR API allows you to build OCR models with ease. You can upload your data, annotate it, set the model to train and wait for getting predictions through a browser based UI without writing a single line of code, worrying about GPUs or finding the right architectures for your deep learning models. You can also acquire the json responses of each
  • 14. prediction to integrate it with your own systems and build machine learning powered apps built on state of the art algorithms and a strong infrastructure. Using the GUI: https://app.nanonets.com/ You can also use the Nanonets-OCR API by following the steps below: Using NanoNets API Below, we will give you a step-by-step guide to training your own model using the Nanonets API, in 9 simple steps. Step 1: Clone the Repo git clone https://github.com/NanoNets/nanonets-ocr-sample-python cd nanonets-ocr-sample-python sudo pip install requests sudo pip install tqdm Step 2: Get your free API Key Get your free API Key fromhttps://app.nanonets.com/#/keys Step 3: Set the API key as an Environment Variable export NANONETS_API_KEY=YOUR_API_KEY_GOES_HERE Step 4: Create a New Model python ./code/create-model.py Note: This generates a MODEL_ID that you need for the next step Step 5: Add Model Id as Environment Variable export NANONETS_MODEL_ID=YOUR_MODEL_ID Step 6: Upload the Training Data Collect the images of object you want to detect. Once you have dataset ready in folder images (image files), start uploading the dataset.
  • 15. python ./code/upload-training.py Step 7: Train Model Once the Images have been uploaded, begin training the Model python ./code/train-model.py Step 8: Get Model State The model takes ~30 minutes to train. You will get an email once the model is trained. In the meanwhile you check the state of the model watch -n 100 python ./code/model-state.py Step 9: Make Prediction Once the model is trained. You can make predictions using the model python ./code/prediction.py PATH_TO_YOUR_IMAGE.jpg N E W E R P O S T A u t o m a t i n g I n v o i c e P r o c e s s i n g w i t h O C R a O L D E R P O S T D e e p L e a r n i n g B a s e d O C R f o r T e x t i n t