SlideShare a Scribd company logo
1 of 8
Download to read offline
Transformer-Based OCR
As you probably already know, Optical Character Recognition (OCR)
is the electronic conversion of images of typed, handwritten, or printed
text into machine-encoded text. The source can be a scanned
document, a photo of a document, or a subtitle text imposed on an
image. OCR converts such sources into machine-readable text.
Let’s understand how an OCR pipeline works before we dig deeper
into Transformer Based OCR.
A typical OCR pipeline consists of two modules.
1. A Text Detection Module
2. A Text Recognition Module
Text Detection Module
The text Detection module as the name suggests detects where text is
present in the source. It aims to localize all the text blocks within the
text image, either at word level (individual words) or text line level.
This task is comparable to an object detection problem only here the
object of interest is the text blocks. Popular object detection algorithms
include YOLOv4/5, Detectron, Mask-RCNN, etc.
To understand Object Detection using YOLO click here.
Text Recognition Module
The text Recognition module aims to understand the content of the
detected text block and convert the visual signals into natural
language tokens.
A typical text recognition module consists of two sub-modules.
1. Word Piece Generation Module
2. Image Understanding
The workflow under the text recognition module works as follows.
● The individual localized text boxes are resized to, let's say,
224x224 and passed as input to the image understanding
module which is typically a CNN module (ResNet with
self-attention).
● The image features from a particular network depth are extracted
and passed as input to the Word Piece Generation Module,
which is an RNN based network. The output of this RNN network
is machine-encoded texts of the localized text boxes.
● Using an appropriate loss function, the Text Recognition Module
is trained until the performance reaches an optimal scale.
What makes transformer-based OCR different?
Transformer-based OCR is an end-to-end transformer-based OCR
model for text recognition, this is one of the first works to jointly
leverage pre-trained image and text transformers.
Transformed-based OCR looks like the diagram below. The left-Hand
side of the diagram is the Vision Transformer Encoder and the
Right-Hand side of the image is the Roberta (Text Transformer)
Decoder.
ViTransformer or Encoder :
An image is split into NxN patches, where each patch is treated
similarly to a token in a sentence. The image patches are flattened
(2D → 1D) and are linearly projected with positional embeddings. The
linear projection + positional embeddings are propagated through the
transformer encoder layers.
In the case of OCR, the image is a series of localized text boxes. To
ensure consistency in localized text boxes, the images/image region
of the text boxes are resized to a HxW. After which the image is
decomposed into patches, where each patch size HW/(PxP). P is the
patch size.
After that, the patches are flattened and linearly projected to a
D-Dimensional vector which is patch embeddings. The patch
embeddings and two special tokens are given learnable 1D position
embeddings according to their absolute positions. Then, the input
sequence is passed through a stack of identical encoder layers.
Each Transformer layer has a multi-head self-attention module and a
fully connected feed-forward network. Both of these two parts are
followed by residual connection and layer normalization.
Note: Residual connections ensure gradient flow during
backpropagation.
Roberta or Decoder :
The output embeddings from a certain depth of the ViTransformers
are extracted & passed as input to the decoder module.
The output embeddings from a certain depth of the ViTransformers
are extracted and passed as input to the decoder module.
The decoder module is also a transformer with a stack of identical
layers that have similar structures to the layers in the encoder, except
that the decoder inserts the “encoder-decoder attention” between the
multi-head self-attention and feedforward network to distribute
different attention on the output of the encoder. In the
encoder-decoder attention module, the keys and values come from
the encoder output, while the queries come from the decoder input.
The embeddings from the decoder are projected from the model
dimension (768) to the dimension of vocabulary size V (50265).
The softmax function calculates the probabilities over the vocabulary
and we use beam search to get the final output.
Advantages:
● TrOCR, an end-to-end Transformer-based OCR model for text
recognition with pre-trained CV and NLP models is the first work
that jointly leverages pre-trained image and text Transformers for
the text recognition task in OCR.
● TrOCR achieves state-of-the-art accuracy with a standard
transformer-based encoder-decoder model, which is convolution
free and does not rely on any complex pre/post-processing step.
References:
TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
https://arxiv.org/pdf/2109.10282.pdf
An image is worth 16X16 words: Transformers for Image Recognition at Scale
https://arxiv.org/pdf/2010.11929v2.pdf

More Related Content

Similar to Transformer-Based OCR.pdf

Hard Decision Viterbi Decoder: Implementation on FPGA and Comparison of Resou...
Hard Decision Viterbi Decoder: Implementation on FPGA and Comparison of Resou...Hard Decision Viterbi Decoder: Implementation on FPGA and Comparison of Resou...
Hard Decision Viterbi Decoder: Implementation on FPGA and Comparison of Resou...IJERA Editor
 
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...NILESH VERMA
 
Attention mechanisms with tensorflow
Attention mechanisms with tensorflowAttention mechanisms with tensorflow
Attention mechanisms with tensorflowKeon Kim
 
IRJET- Automatic Data Collection from Forms using Optical Character Recognition
IRJET- Automatic Data Collection from Forms using Optical Character RecognitionIRJET- Automatic Data Collection from Forms using Optical Character Recognition
IRJET- Automatic Data Collection from Forms using Optical Character RecognitionIRJET Journal
 
ME4AWSN - a Modeling Environment for Architecting WSNs
ME4AWSN - a Modeling Environment for Architecting WSNsME4AWSN - a Modeling Environment for Architecting WSNs
ME4AWSN - a Modeling Environment for Architecting WSNsIvano Malavolta
 
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCRA SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCRIRJET Journal
 
Deep learning Techniques JNTU R20 UNIT 2
Deep learning Techniques JNTU R20 UNIT 2Deep learning Techniques JNTU R20 UNIT 2
Deep learning Techniques JNTU R20 UNIT 2EXAMCELLH4
 
PORTABLE CAMERA-BASED ASSISTIVE TEXT AND PRODUCT LABEL READING FROM HAND- H...
PORTABLE CAMERA-BASED  ASSISTIVE TEXT AND PRODUCT  LABEL READING FROM HAND- H...PORTABLE CAMERA-BASED  ASSISTIVE TEXT AND PRODUCT  LABEL READING FROM HAND- H...
PORTABLE CAMERA-BASED ASSISTIVE TEXT AND PRODUCT LABEL READING FROM HAND- H...Sathmica K
 
Autosar fundamental
Autosar fundamentalAutosar fundamental
Autosar fundamentalOmkar Rane
 
Robot operating systems (ros) overview & (1)
Robot operating systems (ros) overview & (1)Robot operating systems (ros) overview & (1)
Robot operating systems (ros) overview & (1)Piyush Chand
 
Robot Operating Systems (Ros) Overview & (1)
Robot Operating Systems (Ros) Overview & (1)Robot Operating Systems (Ros) Overview & (1)
Robot Operating Systems (Ros) Overview & (1)Piyush Chand
 
Inpainting scheme for text in video a survey
Inpainting scheme for text in video   a surveyInpainting scheme for text in video   a survey
Inpainting scheme for text in video a surveyeSAT Journals
 
Optical Character Recognition from Text Image
Optical Character Recognition from Text ImageOptical Character Recognition from Text Image
Optical Character Recognition from Text ImageEditor IJCATR
 

Similar to Transformer-Based OCR.pdf (20)

Hard Decision Viterbi Decoder: Implementation on FPGA and Comparison of Resou...
Hard Decision Viterbi Decoder: Implementation on FPGA and Comparison of Resou...Hard Decision Viterbi Decoder: Implementation on FPGA and Comparison of Resou...
Hard Decision Viterbi Decoder: Implementation on FPGA and Comparison of Resou...
 
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
 
Attention mechanisms with tensorflow
Attention mechanisms with tensorflowAttention mechanisms with tensorflow
Attention mechanisms with tensorflow
 
IRJET- Automatic Data Collection from Forms using Optical Character Recognition
IRJET- Automatic Data Collection from Forms using Optical Character RecognitionIRJET- Automatic Data Collection from Forms using Optical Character Recognition
IRJET- Automatic Data Collection from Forms using Optical Character Recognition
 
ME4AWSN - a Modeling Environment for Architecting WSNs
ME4AWSN - a Modeling Environment for Architecting WSNsME4AWSN - a Modeling Environment for Architecting WSNs
ME4AWSN - a Modeling Environment for Architecting WSNs
 
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCRA SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
 
Deep learning Techniques JNTU R20 UNIT 2
Deep learning Techniques JNTU R20 UNIT 2Deep learning Techniques JNTU R20 UNIT 2
Deep learning Techniques JNTU R20 UNIT 2
 
proj (2)
proj (2)proj (2)
proj (2)
 
A12REVIEW.pptx
A12REVIEW.pptxA12REVIEW.pptx
A12REVIEW.pptx
 
LSDI 2.pptx
LSDI 2.pptxLSDI 2.pptx
LSDI 2.pptx
 
Xdr ppt
Xdr pptXdr ppt
Xdr ppt
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
 
ppt.pptx
ppt.pptxppt.pptx
ppt.pptx
 
Presentation 1
Presentation 1Presentation 1
Presentation 1
 
PORTABLE CAMERA-BASED ASSISTIVE TEXT AND PRODUCT LABEL READING FROM HAND- H...
PORTABLE CAMERA-BASED  ASSISTIVE TEXT AND PRODUCT  LABEL READING FROM HAND- H...PORTABLE CAMERA-BASED  ASSISTIVE TEXT AND PRODUCT  LABEL READING FROM HAND- H...
PORTABLE CAMERA-BASED ASSISTIVE TEXT AND PRODUCT LABEL READING FROM HAND- H...
 
Autosar fundamental
Autosar fundamentalAutosar fundamental
Autosar fundamental
 
Robot operating systems (ros) overview & (1)
Robot operating systems (ros) overview & (1)Robot operating systems (ros) overview & (1)
Robot operating systems (ros) overview & (1)
 
Robot Operating Systems (Ros) Overview & (1)
Robot Operating Systems (Ros) Overview & (1)Robot Operating Systems (Ros) Overview & (1)
Robot Operating Systems (Ros) Overview & (1)
 
Inpainting scheme for text in video a survey
Inpainting scheme for text in video   a surveyInpainting scheme for text in video   a survey
Inpainting scheme for text in video a survey
 
Optical Character Recognition from Text Image
Optical Character Recognition from Text ImageOptical Character Recognition from Text Image
Optical Character Recognition from Text Image
 

More from Infrrd

Intelligent Document Processing
Intelligent Document ProcessingIntelligent Document Processing
Intelligent Document ProcessingInfrrd
 
IDP: A Booster Shot for your RPA, Chatbot and Low Code Implementations
IDP: A Booster Shot for your RPA, Chatbot and Low Code ImplementationsIDP: A Booster Shot for your RPA, Chatbot and Low Code Implementations
IDP: A Booster Shot for your RPA, Chatbot and Low Code ImplementationsInfrrd
 
Using Alerts To Gain Efficiency For Document Processing.pdf
Using Alerts To Gain Efficiency For Document Processing.pdfUsing Alerts To Gain Efficiency For Document Processing.pdf
Using Alerts To Gain Efficiency For Document Processing.pdfInfrrd
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Infrrd
 
Launching Infrrd IDP's Latest Features
Launching Infrrd IDP's Latest FeaturesLaunching Infrrd IDP's Latest Features
Launching Infrrd IDP's Latest FeaturesInfrrd
 
Invoice processing
Invoice processingInvoice processing
Invoice processingInfrrd
 
Where have all the data entry candidates gone?
Where have all the data entry candidates gone?Where have all the data entry candidates gone?
Where have all the data entry candidates gone?Infrrd
 
Frequently Asked Questions About IDP
Frequently Asked Questions About IDPFrequently Asked Questions About IDP
Frequently Asked Questions About IDPInfrrd
 
IDP with Intelligent Table Extraction
IDP with Intelligent Table ExtractionIDP with Intelligent Table Extraction
IDP with Intelligent Table ExtractionInfrrd
 
Document Types Explained: Structured, Semi-Structured and Unstructured
Document Types Explained: Structured, Semi-Structured and UnstructuredDocument Types Explained: Structured, Semi-Structured and Unstructured
Document Types Explained: Structured, Semi-Structured and UnstructuredInfrrd
 
Understanding IDP: Data Integration
Understanding IDP: Data IntegrationUnderstanding IDP: Data Integration
Understanding IDP: Data IntegrationInfrrd
 
Understanding IDP: Data Validation and Feedback Loop
Understanding IDP: Data Validation and Feedback LoopUnderstanding IDP: Data Validation and Feedback Loop
Understanding IDP: Data Validation and Feedback LoopInfrrd
 
Understanding IDP: Document Classification
Understanding IDP: Document ClassificationUnderstanding IDP: Document Classification
Understanding IDP: Document ClassificationInfrrd
 
Who are the top intelligent document processing (idp) vendors
Who are the top intelligent document processing (idp) vendors Who are the top intelligent document processing (idp) vendors
Who are the top intelligent document processing (idp) vendors Infrrd
 
Infrrd's AI-enabled Audit Automation
Infrrd's AI-enabled Audit AutomationInfrrd's AI-enabled Audit Automation
Infrrd's AI-enabled Audit AutomationInfrrd
 
How To Start Your Journey To Become An AI Enabled Enterprise?
How To Start Your Journey To Become An AI Enabled Enterprise?How To Start Your Journey To Become An AI Enabled Enterprise?
How To Start Your Journey To Become An AI Enabled Enterprise?Infrrd
 
Intelligent Data Capture Process
Intelligent Data Capture Process Intelligent Data Capture Process
Intelligent Data Capture Process Infrrd
 

More from Infrrd (17)

Intelligent Document Processing
Intelligent Document ProcessingIntelligent Document Processing
Intelligent Document Processing
 
IDP: A Booster Shot for your RPA, Chatbot and Low Code Implementations
IDP: A Booster Shot for your RPA, Chatbot and Low Code ImplementationsIDP: A Booster Shot for your RPA, Chatbot and Low Code Implementations
IDP: A Booster Shot for your RPA, Chatbot and Low Code Implementations
 
Using Alerts To Gain Efficiency For Document Processing.pdf
Using Alerts To Gain Efficiency For Document Processing.pdfUsing Alerts To Gain Efficiency For Document Processing.pdf
Using Alerts To Gain Efficiency For Document Processing.pdf
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
 
Launching Infrrd IDP's Latest Features
Launching Infrrd IDP's Latest FeaturesLaunching Infrrd IDP's Latest Features
Launching Infrrd IDP's Latest Features
 
Invoice processing
Invoice processingInvoice processing
Invoice processing
 
Where have all the data entry candidates gone?
Where have all the data entry candidates gone?Where have all the data entry candidates gone?
Where have all the data entry candidates gone?
 
Frequently Asked Questions About IDP
Frequently Asked Questions About IDPFrequently Asked Questions About IDP
Frequently Asked Questions About IDP
 
IDP with Intelligent Table Extraction
IDP with Intelligent Table ExtractionIDP with Intelligent Table Extraction
IDP with Intelligent Table Extraction
 
Document Types Explained: Structured, Semi-Structured and Unstructured
Document Types Explained: Structured, Semi-Structured and UnstructuredDocument Types Explained: Structured, Semi-Structured and Unstructured
Document Types Explained: Structured, Semi-Structured and Unstructured
 
Understanding IDP: Data Integration
Understanding IDP: Data IntegrationUnderstanding IDP: Data Integration
Understanding IDP: Data Integration
 
Understanding IDP: Data Validation and Feedback Loop
Understanding IDP: Data Validation and Feedback LoopUnderstanding IDP: Data Validation and Feedback Loop
Understanding IDP: Data Validation and Feedback Loop
 
Understanding IDP: Document Classification
Understanding IDP: Document ClassificationUnderstanding IDP: Document Classification
Understanding IDP: Document Classification
 
Who are the top intelligent document processing (idp) vendors
Who are the top intelligent document processing (idp) vendors Who are the top intelligent document processing (idp) vendors
Who are the top intelligent document processing (idp) vendors
 
Infrrd's AI-enabled Audit Automation
Infrrd's AI-enabled Audit AutomationInfrrd's AI-enabled Audit Automation
Infrrd's AI-enabled Audit Automation
 
How To Start Your Journey To Become An AI Enabled Enterprise?
How To Start Your Journey To Become An AI Enabled Enterprise?How To Start Your Journey To Become An AI Enabled Enterprise?
How To Start Your Journey To Become An AI Enabled Enterprise?
 
Intelligent Data Capture Process
Intelligent Data Capture Process Intelligent Data Capture Process
Intelligent Data Capture Process
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Transformer-Based OCR.pdf

  • 1. Transformer-Based OCR As you probably already know, Optical Character Recognition (OCR) is the electronic conversion of images of typed, handwritten, or printed text into machine-encoded text. The source can be a scanned document, a photo of a document, or a subtitle text imposed on an image. OCR converts such sources into machine-readable text. Let’s understand how an OCR pipeline works before we dig deeper into Transformer Based OCR. A typical OCR pipeline consists of two modules.
  • 2. 1. A Text Detection Module 2. A Text Recognition Module Text Detection Module The text Detection module as the name suggests detects where text is present in the source. It aims to localize all the text blocks within the text image, either at word level (individual words) or text line level. This task is comparable to an object detection problem only here the object of interest is the text blocks. Popular object detection algorithms include YOLOv4/5, Detectron, Mask-RCNN, etc. To understand Object Detection using YOLO click here.
  • 3. Text Recognition Module The text Recognition module aims to understand the content of the detected text block and convert the visual signals into natural language tokens. A typical text recognition module consists of two sub-modules. 1. Word Piece Generation Module 2. Image Understanding The workflow under the text recognition module works as follows. ● The individual localized text boxes are resized to, let's say, 224x224 and passed as input to the image understanding module which is typically a CNN module (ResNet with self-attention). ● The image features from a particular network depth are extracted and passed as input to the Word Piece Generation Module, which is an RNN based network. The output of this RNN network is machine-encoded texts of the localized text boxes. ● Using an appropriate loss function, the Text Recognition Module is trained until the performance reaches an optimal scale.
  • 4. What makes transformer-based OCR different? Transformer-based OCR is an end-to-end transformer-based OCR model for text recognition, this is one of the first works to jointly leverage pre-trained image and text transformers. Transformed-based OCR looks like the diagram below. The left-Hand side of the diagram is the Vision Transformer Encoder and the Right-Hand side of the image is the Roberta (Text Transformer) Decoder.
  • 5. ViTransformer or Encoder : An image is split into NxN patches, where each patch is treated similarly to a token in a sentence. The image patches are flattened (2D → 1D) and are linearly projected with positional embeddings. The linear projection + positional embeddings are propagated through the transformer encoder layers. In the case of OCR, the image is a series of localized text boxes. To ensure consistency in localized text boxes, the images/image region of the text boxes are resized to a HxW. After which the image is decomposed into patches, where each patch size HW/(PxP). P is the patch size.
  • 6. After that, the patches are flattened and linearly projected to a D-Dimensional vector which is patch embeddings. The patch embeddings and two special tokens are given learnable 1D position embeddings according to their absolute positions. Then, the input sequence is passed through a stack of identical encoder layers. Each Transformer layer has a multi-head self-attention module and a fully connected feed-forward network. Both of these two parts are followed by residual connection and layer normalization. Note: Residual connections ensure gradient flow during backpropagation. Roberta or Decoder :
  • 7. The output embeddings from a certain depth of the ViTransformers are extracted & passed as input to the decoder module. The output embeddings from a certain depth of the ViTransformers are extracted and passed as input to the decoder module. The decoder module is also a transformer with a stack of identical layers that have similar structures to the layers in the encoder, except that the decoder inserts the “encoder-decoder attention” between the multi-head self-attention and feedforward network to distribute different attention on the output of the encoder. In the encoder-decoder attention module, the keys and values come from the encoder output, while the queries come from the decoder input. The embeddings from the decoder are projected from the model dimension (768) to the dimension of vocabulary size V (50265). The softmax function calculates the probabilities over the vocabulary and we use beam search to get the final output. Advantages: ● TrOCR, an end-to-end Transformer-based OCR model for text recognition with pre-trained CV and NLP models is the first work that jointly leverages pre-trained image and text Transformers for the text recognition task in OCR.
  • 8. ● TrOCR achieves state-of-the-art accuracy with a standard transformer-based encoder-decoder model, which is convolution free and does not rely on any complex pre/post-processing step. References: TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models https://arxiv.org/pdf/2109.10282.pdf An image is worth 16X16 words: Transformers for Image Recognition at Scale https://arxiv.org/pdf/2010.11929v2.pdf