SlideShare a Scribd company logo
Transformer-Based OCR
As you probably already know, Optical Character Recognition (OCR)
is the electronic conversion of images of typed, handwritten, or printed
text into machine-encoded text. The source can be a scanned
document, a photo of a document, or a subtitle text imposed on an
image. OCR converts such sources into machine-readable text.
Let’s understand how an OCR pipeline works before we dig deeper
into Transformer Based OCR.
A typical OCR pipeline consists of two modules.
1. A Text Detection Module
2. A Text Recognition Module
Text Detection Module
The text Detection module as the name suggests detects where text is
present in the source. It aims to localize all the text blocks within the
text image, either at word level (individual words) or text line level.
This task is comparable to an object detection problem only here the
object of interest is the text blocks. Popular object detection algorithms
include YOLOv4/5, Detectron, Mask-RCNN, etc.
To understand Object Detection using YOLO click here.
Text Recognition Module
The text Recognition module aims to understand the content of the
detected text block and convert the visual signals into natural
language tokens.
A typical text recognition module consists of two sub-modules.
1. Word Piece Generation Module
2. Image Understanding
The workflow under the text recognition module works as follows.
● The individual localized text boxes are resized to, let's say,
224x224 and passed as input to the image understanding
module which is typically a CNN module (ResNet with
self-attention).
● The image features from a particular network depth are extracted
and passed as input to the Word Piece Generation Module,
which is an RNN based network. The output of this RNN network
is machine-encoded texts of the localized text boxes.
● Using an appropriate loss function, the Text Recognition Module
is trained until the performance reaches an optimal scale.
What makes transformer-based OCR different?
Transformer-based OCR is an end-to-end transformer-based OCR
model for text recognition, this is one of the first works to jointly
leverage pre-trained image and text transformers.
Transformed-based OCR looks like the diagram below. The left-Hand
side of the diagram is the Vision Transformer Encoder and the
Right-Hand side of the image is the Roberta (Text Transformer)
Decoder.
ViTransformer or Encoder :
An image is split into NxN patches, where each patch is treated
similarly to a token in a sentence. The image patches are flattened
(2D → 1D) and are linearly projected with positional embeddings. The
linear projection + positional embeddings are propagated through the
transformer encoder layers.
In the case of OCR, the image is a series of localized text boxes. To
ensure consistency in localized text boxes, the images/image region
of the text boxes are resized to a HxW. After which the image is
decomposed into patches, where each patch size HW/(PxP). P is the
patch size.
After that, the patches are flattened and linearly projected to a
D-Dimensional vector which is patch embeddings. The patch
embeddings and two special tokens are given learnable 1D position
embeddings according to their absolute positions. Then, the input
sequence is passed through a stack of identical encoder layers.
Each Transformer layer has a multi-head self-attention module and a
fully connected feed-forward network. Both of these two parts are
followed by residual connection and layer normalization.
Note: Residual connections ensure gradient flow during
backpropagation.
Roberta or Decoder :
The output embeddings from a certain depth of the ViTransformers
are extracted & passed as input to the decoder module.
The output embeddings from a certain depth of the ViTransformers
are extracted and passed as input to the decoder module.
The decoder module is also a transformer with a stack of identical
layers that have similar structures to the layers in the encoder, except
that the decoder inserts the “encoder-decoder attention” between the
multi-head self-attention and feedforward network to distribute
different attention on the output of the encoder. In the
encoder-decoder attention module, the keys and values come from
the encoder output, while the queries come from the decoder input.
The embeddings from the decoder are projected from the model
dimension (768) to the dimension of vocabulary size V (50265).
The softmax function calculates the probabilities over the vocabulary
and we use beam search to get the final output.
Advantages:
● TrOCR, an end-to-end Transformer-based OCR model for text
recognition with pre-trained CV and NLP models is the first work
that jointly leverages pre-trained image and text Transformers for
the text recognition task in OCR.
● TrOCR achieves state-of-the-art accuracy with a standard
transformer-based encoder-decoder model, which is convolution
free and does not rely on any complex pre/post-processing step.
References:
TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
https://arxiv.org/pdf/2109.10282.pdf
An image is worth 16X16 words: Transformers for Image Recognition at Scale
https://arxiv.org/pdf/2010.11929v2.pdf

More Related Content

Similar to Transformer-Based OCR.pdf

Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
RahulKumar854607
 
Hard Decision Viterbi Decoder: Implementation on FPGA and Comparison of Resou...
Hard Decision Viterbi Decoder: Implementation on FPGA and Comparison of Resou...Hard Decision Viterbi Decoder: Implementation on FPGA and Comparison of Resou...
Hard Decision Viterbi Decoder: Implementation on FPGA and Comparison of Resou...
IJERA Editor
 
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
NILESH VERMA
 
Attention mechanisms with tensorflow
Attention mechanisms with tensorflowAttention mechanisms with tensorflow
Attention mechanisms with tensorflow
Keon Kim
 
IRJET- Automatic Data Collection from Forms using Optical Character Recognition
IRJET- Automatic Data Collection from Forms using Optical Character RecognitionIRJET- Automatic Data Collection from Forms using Optical Character Recognition
IRJET- Automatic Data Collection from Forms using Optical Character Recognition
IRJET Journal
 
ME4AWSN - a Modeling Environment for Architecting WSNs
ME4AWSN - a Modeling Environment for Architecting WSNsME4AWSN - a Modeling Environment for Architecting WSNs
ME4AWSN - a Modeling Environment for Architecting WSNs
Ivano Malavolta
 
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCRA SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
IRJET Journal
 
Deep learning Techniques JNTU R20 UNIT 2
Deep learning Techniques JNTU R20 UNIT 2Deep learning Techniques JNTU R20 UNIT 2
Deep learning Techniques JNTU R20 UNIT 2
EXAMCELLH4
 
A12REVIEW.pptx
A12REVIEW.pptxA12REVIEW.pptx
A12REVIEW.pptx
Moinuddin143394
 
LSDI 2.pptx
LSDI 2.pptxLSDI 2.pptx
LSDI 2.pptx
HisokaFreecs
 
Xdr ppt
Xdr pptXdr ppt
Xdr ppt
Nidhi Thakkar
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
Grigory Sapunov
 
ppt.pptx
ppt.pptxppt.pptx
Presentation 1
Presentation 1Presentation 1
Presentation 1
RONITKUMAR77
 
PORTABLE CAMERA-BASED ASSISTIVE TEXT AND PRODUCT LABEL READING FROM HAND- H...
PORTABLE CAMERA-BASED  ASSISTIVE TEXT AND PRODUCT  LABEL READING FROM HAND- H...PORTABLE CAMERA-BASED  ASSISTIVE TEXT AND PRODUCT  LABEL READING FROM HAND- H...
PORTABLE CAMERA-BASED ASSISTIVE TEXT AND PRODUCT LABEL READING FROM HAND- H...
Sathmica K
 
Autosar fundamental
Autosar fundamentalAutosar fundamental
Autosar fundamental
Omkar Rane
 
Robot Operating Systems (Ros) Overview & (1)
Robot Operating Systems (Ros) Overview & (1)Robot Operating Systems (Ros) Overview & (1)
Robot Operating Systems (Ros) Overview & (1)Piyush Chand
 
Robot operating systems (ros) overview & (1)
Robot operating systems (ros) overview & (1)Robot operating systems (ros) overview & (1)
Robot operating systems (ros) overview & (1)
Piyush Chand
 
Inpainting scheme for text in video a survey
Inpainting scheme for text in video   a surveyInpainting scheme for text in video   a survey
Inpainting scheme for text in video a survey
eSAT Journals
 

Similar to Transformer-Based OCR.pdf (20)

Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
Hard Decision Viterbi Decoder: Implementation on FPGA and Comparison of Resou...
Hard Decision Viterbi Decoder: Implementation on FPGA and Comparison of Resou...Hard Decision Viterbi Decoder: Implementation on FPGA and Comparison of Resou...
Hard Decision Viterbi Decoder: Implementation on FPGA and Comparison of Resou...
 
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
 
Attention mechanisms with tensorflow
Attention mechanisms with tensorflowAttention mechanisms with tensorflow
Attention mechanisms with tensorflow
 
IRJET- Automatic Data Collection from Forms using Optical Character Recognition
IRJET- Automatic Data Collection from Forms using Optical Character RecognitionIRJET- Automatic Data Collection from Forms using Optical Character Recognition
IRJET- Automatic Data Collection from Forms using Optical Character Recognition
 
ME4AWSN - a Modeling Environment for Architecting WSNs
ME4AWSN - a Modeling Environment for Architecting WSNsME4AWSN - a Modeling Environment for Architecting WSNs
ME4AWSN - a Modeling Environment for Architecting WSNs
 
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCRA SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
 
Deep learning Techniques JNTU R20 UNIT 2
Deep learning Techniques JNTU R20 UNIT 2Deep learning Techniques JNTU R20 UNIT 2
Deep learning Techniques JNTU R20 UNIT 2
 
proj (2)
proj (2)proj (2)
proj (2)
 
A12REVIEW.pptx
A12REVIEW.pptxA12REVIEW.pptx
A12REVIEW.pptx
 
LSDI 2.pptx
LSDI 2.pptxLSDI 2.pptx
LSDI 2.pptx
 
Xdr ppt
Xdr pptXdr ppt
Xdr ppt
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
 
ppt.pptx
ppt.pptxppt.pptx
ppt.pptx
 
Presentation 1
Presentation 1Presentation 1
Presentation 1
 
PORTABLE CAMERA-BASED ASSISTIVE TEXT AND PRODUCT LABEL READING FROM HAND- H...
PORTABLE CAMERA-BASED  ASSISTIVE TEXT AND PRODUCT  LABEL READING FROM HAND- H...PORTABLE CAMERA-BASED  ASSISTIVE TEXT AND PRODUCT  LABEL READING FROM HAND- H...
PORTABLE CAMERA-BASED ASSISTIVE TEXT AND PRODUCT LABEL READING FROM HAND- H...
 
Autosar fundamental
Autosar fundamentalAutosar fundamental
Autosar fundamental
 
Robot Operating Systems (Ros) Overview & (1)
Robot Operating Systems (Ros) Overview & (1)Robot Operating Systems (Ros) Overview & (1)
Robot Operating Systems (Ros) Overview & (1)
 
Robot operating systems (ros) overview & (1)
Robot operating systems (ros) overview & (1)Robot operating systems (ros) overview & (1)
Robot operating systems (ros) overview & (1)
 
Inpainting scheme for text in video a survey
Inpainting scheme for text in video   a surveyInpainting scheme for text in video   a survey
Inpainting scheme for text in video a survey
 

More from Infrrd

Intelligent Document Processing
Intelligent Document ProcessingIntelligent Document Processing
Intelligent Document Processing
Infrrd
 
IDP: A Booster Shot for your RPA, Chatbot and Low Code Implementations
IDP: A Booster Shot for your RPA, Chatbot and Low Code ImplementationsIDP: A Booster Shot for your RPA, Chatbot and Low Code Implementations
IDP: A Booster Shot for your RPA, Chatbot and Low Code Implementations
Infrrd
 
Using Alerts To Gain Efficiency For Document Processing.pdf
Using Alerts To Gain Efficiency For Document Processing.pdfUsing Alerts To Gain Efficiency For Document Processing.pdf
Using Alerts To Gain Efficiency For Document Processing.pdf
Infrrd
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
Infrrd
 
Launching Infrrd IDP's Latest Features
Launching Infrrd IDP's Latest FeaturesLaunching Infrrd IDP's Latest Features
Launching Infrrd IDP's Latest Features
Infrrd
 
Invoice processing
Invoice processingInvoice processing
Invoice processing
Infrrd
 
Where have all the data entry candidates gone?
Where have all the data entry candidates gone?Where have all the data entry candidates gone?
Where have all the data entry candidates gone?
Infrrd
 
Frequently Asked Questions About IDP
Frequently Asked Questions About IDPFrequently Asked Questions About IDP
Frequently Asked Questions About IDP
Infrrd
 
IDP with Intelligent Table Extraction
IDP with Intelligent Table ExtractionIDP with Intelligent Table Extraction
IDP with Intelligent Table Extraction
Infrrd
 
Document Types Explained: Structured, Semi-Structured and Unstructured
Document Types Explained: Structured, Semi-Structured and UnstructuredDocument Types Explained: Structured, Semi-Structured and Unstructured
Document Types Explained: Structured, Semi-Structured and Unstructured
Infrrd
 
Understanding IDP: Data Integration
Understanding IDP: Data IntegrationUnderstanding IDP: Data Integration
Understanding IDP: Data Integration
Infrrd
 
Understanding IDP: Data Validation and Feedback Loop
Understanding IDP: Data Validation and Feedback LoopUnderstanding IDP: Data Validation and Feedback Loop
Understanding IDP: Data Validation and Feedback Loop
Infrrd
 
Understanding IDP: Document Classification
Understanding IDP: Document ClassificationUnderstanding IDP: Document Classification
Understanding IDP: Document Classification
Infrrd
 
Who are the top intelligent document processing (idp) vendors
Who are the top intelligent document processing (idp) vendors Who are the top intelligent document processing (idp) vendors
Who are the top intelligent document processing (idp) vendors
Infrrd
 
Infrrd's AI-enabled Audit Automation
Infrrd's AI-enabled Audit AutomationInfrrd's AI-enabled Audit Automation
Infrrd's AI-enabled Audit Automation
Infrrd
 
How To Start Your Journey To Become An AI Enabled Enterprise?
How To Start Your Journey To Become An AI Enabled Enterprise?How To Start Your Journey To Become An AI Enabled Enterprise?
How To Start Your Journey To Become An AI Enabled Enterprise?
Infrrd
 
Intelligent Data Capture Process
Intelligent Data Capture Process Intelligent Data Capture Process
Intelligent Data Capture Process
Infrrd
 

More from Infrrd (17)

Intelligent Document Processing
Intelligent Document ProcessingIntelligent Document Processing
Intelligent Document Processing
 
IDP: A Booster Shot for your RPA, Chatbot and Low Code Implementations
IDP: A Booster Shot for your RPA, Chatbot and Low Code ImplementationsIDP: A Booster Shot for your RPA, Chatbot and Low Code Implementations
IDP: A Booster Shot for your RPA, Chatbot and Low Code Implementations
 
Using Alerts To Gain Efficiency For Document Processing.pdf
Using Alerts To Gain Efficiency For Document Processing.pdfUsing Alerts To Gain Efficiency For Document Processing.pdf
Using Alerts To Gain Efficiency For Document Processing.pdf
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
 
Launching Infrrd IDP's Latest Features
Launching Infrrd IDP's Latest FeaturesLaunching Infrrd IDP's Latest Features
Launching Infrrd IDP's Latest Features
 
Invoice processing
Invoice processingInvoice processing
Invoice processing
 
Where have all the data entry candidates gone?
Where have all the data entry candidates gone?Where have all the data entry candidates gone?
Where have all the data entry candidates gone?
 
Frequently Asked Questions About IDP
Frequently Asked Questions About IDPFrequently Asked Questions About IDP
Frequently Asked Questions About IDP
 
IDP with Intelligent Table Extraction
IDP with Intelligent Table ExtractionIDP with Intelligent Table Extraction
IDP with Intelligent Table Extraction
 
Document Types Explained: Structured, Semi-Structured and Unstructured
Document Types Explained: Structured, Semi-Structured and UnstructuredDocument Types Explained: Structured, Semi-Structured and Unstructured
Document Types Explained: Structured, Semi-Structured and Unstructured
 
Understanding IDP: Data Integration
Understanding IDP: Data IntegrationUnderstanding IDP: Data Integration
Understanding IDP: Data Integration
 
Understanding IDP: Data Validation and Feedback Loop
Understanding IDP: Data Validation and Feedback LoopUnderstanding IDP: Data Validation and Feedback Loop
Understanding IDP: Data Validation and Feedback Loop
 
Understanding IDP: Document Classification
Understanding IDP: Document ClassificationUnderstanding IDP: Document Classification
Understanding IDP: Document Classification
 
Who are the top intelligent document processing (idp) vendors
Who are the top intelligent document processing (idp) vendors Who are the top intelligent document processing (idp) vendors
Who are the top intelligent document processing (idp) vendors
 
Infrrd's AI-enabled Audit Automation
Infrrd's AI-enabled Audit AutomationInfrrd's AI-enabled Audit Automation
Infrrd's AI-enabled Audit Automation
 
How To Start Your Journey To Become An AI Enabled Enterprise?
How To Start Your Journey To Become An AI Enabled Enterprise?How To Start Your Journey To Become An AI Enabled Enterprise?
How To Start Your Journey To Become An AI Enabled Enterprise?
 
Intelligent Data Capture Process
Intelligent Data Capture Process Intelligent Data Capture Process
Intelligent Data Capture Process
 

Recently uploaded

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 

Recently uploaded (20)

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 

Transformer-Based OCR.pdf

  • 1. Transformer-Based OCR As you probably already know, Optical Character Recognition (OCR) is the electronic conversion of images of typed, handwritten, or printed text into machine-encoded text. The source can be a scanned document, a photo of a document, or a subtitle text imposed on an image. OCR converts such sources into machine-readable text. Let’s understand how an OCR pipeline works before we dig deeper into Transformer Based OCR. A typical OCR pipeline consists of two modules.
  • 2. 1. A Text Detection Module 2. A Text Recognition Module Text Detection Module The text Detection module as the name suggests detects where text is present in the source. It aims to localize all the text blocks within the text image, either at word level (individual words) or text line level. This task is comparable to an object detection problem only here the object of interest is the text blocks. Popular object detection algorithms include YOLOv4/5, Detectron, Mask-RCNN, etc. To understand Object Detection using YOLO click here.
  • 3. Text Recognition Module The text Recognition module aims to understand the content of the detected text block and convert the visual signals into natural language tokens. A typical text recognition module consists of two sub-modules. 1. Word Piece Generation Module 2. Image Understanding The workflow under the text recognition module works as follows. ● The individual localized text boxes are resized to, let's say, 224x224 and passed as input to the image understanding module which is typically a CNN module (ResNet with self-attention). ● The image features from a particular network depth are extracted and passed as input to the Word Piece Generation Module, which is an RNN based network. The output of this RNN network is machine-encoded texts of the localized text boxes. ● Using an appropriate loss function, the Text Recognition Module is trained until the performance reaches an optimal scale.
  • 4. What makes transformer-based OCR different? Transformer-based OCR is an end-to-end transformer-based OCR model for text recognition, this is one of the first works to jointly leverage pre-trained image and text transformers. Transformed-based OCR looks like the diagram below. The left-Hand side of the diagram is the Vision Transformer Encoder and the Right-Hand side of the image is the Roberta (Text Transformer) Decoder.
  • 5. ViTransformer or Encoder : An image is split into NxN patches, where each patch is treated similarly to a token in a sentence. The image patches are flattened (2D → 1D) and are linearly projected with positional embeddings. The linear projection + positional embeddings are propagated through the transformer encoder layers. In the case of OCR, the image is a series of localized text boxes. To ensure consistency in localized text boxes, the images/image region of the text boxes are resized to a HxW. After which the image is decomposed into patches, where each patch size HW/(PxP). P is the patch size.
  • 6. After that, the patches are flattened and linearly projected to a D-Dimensional vector which is patch embeddings. The patch embeddings and two special tokens are given learnable 1D position embeddings according to their absolute positions. Then, the input sequence is passed through a stack of identical encoder layers. Each Transformer layer has a multi-head self-attention module and a fully connected feed-forward network. Both of these two parts are followed by residual connection and layer normalization. Note: Residual connections ensure gradient flow during backpropagation. Roberta or Decoder :
  • 7. The output embeddings from a certain depth of the ViTransformers are extracted & passed as input to the decoder module. The output embeddings from a certain depth of the ViTransformers are extracted and passed as input to the decoder module. The decoder module is also a transformer with a stack of identical layers that have similar structures to the layers in the encoder, except that the decoder inserts the “encoder-decoder attention” between the multi-head self-attention and feedforward network to distribute different attention on the output of the encoder. In the encoder-decoder attention module, the keys and values come from the encoder output, while the queries come from the decoder input. The embeddings from the decoder are projected from the model dimension (768) to the dimension of vocabulary size V (50265). The softmax function calculates the probabilities over the vocabulary and we use beam search to get the final output. Advantages: ● TrOCR, an end-to-end Transformer-based OCR model for text recognition with pre-trained CV and NLP models is the first work that jointly leverages pre-trained image and text Transformers for the text recognition task in OCR.
  • 8. ● TrOCR achieves state-of-the-art accuracy with a standard transformer-based encoder-decoder model, which is convolution free and does not rely on any complex pre/post-processing step. References: TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models https://arxiv.org/pdf/2109.10282.pdf An image is worth 16X16 words: Transformers for Image Recognition at Scale https://arxiv.org/pdf/2010.11929v2.pdf