Optical Character Recognition: from data driven to self-supervised learning (ICDAR WML 2023 Keynote)

Optical Character Recognition
From data-driven to self-supervised learning
Lianwen Jin
South China University of Technology
http://www.dlvc-lab.net/lianwen/
August 26, 2023
ICDAR WML 2023

ICDAR WML 2023
Outline
l Introduction
l Data Synthesis for OCR
• Handwriting Style Synthesis for Arbitrary-Length and Out-of-Vocabulary Text
• Forgery- Synthesis for Handwritten Signature Verification
l Weakly Supervised Learning Methods for OCR
• Single-Point Text Spotting (SPTS )
• Weakly Supervised Handwritten Chinese Text Recognition (Page-level)
l Self-Supervised Learning Methods for OCR
• A Brief Review of SSL in OCR
• Revisiting Scene Text Recognition: A Data Perspective
l Discussion and Conclusion
2

• OCR (Optical Character Recognition) is an important research field in PR and AI
• Scene Text Recognition (STR) has attracted great research attention in recent years
• Research trends: from visually perceptive to semantically driven, from strongly supervised to self-supervised learning
Irregular Scene Text Recognition
RARE, CVPR’16
ASTER, TPAMI’18
MORAN, PR’19
ESIR, CVPR’19
ScRN, ICCV’19
SAR, AAAI’19
Encoder/Decoder Model
FAN, ICCV’17
DAN, AAAI’20
GTC, AAAI’20
ACE, CVPR ‘20
RobustScanner, ECCV’20
SCATTER, CVPR’20
IFA, CVPR’21
Segmentation-based
CA-FCN, AAAI’19
Mask TextSpotter v2, TPAMI’19
TextScanner, AAAI’20
SegHCCR, TMM’22
PageNet, IJCV 20222
SegCTC, ICDAR 2023
Image Enhancement
PlugNet, ECCV’20
SPIN, AAAI’21
STT, CVPR’21
TATT, CVPR ’22
PSRB-DIP, ICDAR ‘23
Data Synthesis
Synth90K, IJCV’16
SynthText, CVPR’16
VerisimilarText, ECCV’18
UnRealText, CVPR’20
ScrabbleGAN, CVPR’20
GANwriting, ECCV’20, TPAMI’22
HiGAN, AAAI’21; HiGAN++, TOG’22
VATr, CVPR 2023
GC-DDPM, ICDAR 2023
Language Modeling
SRN, CVPR’20; SEED, CVPR’20
DictGuide, CVPR’21
Bhunia et al., CVPR’21
FromTwoToOne, ICCV’21
ABINet/++, CVPR’21, TPAMI’22
ViSA, ICDAR 2023
CLIP-TCM, CVPR 2023
Self-Supervised Learning
Bhunia et al., CVPR’21
Baek et al., CVPR’21; SeqCLR, CVPR’21
ConCLR, AAAI’22
SimAN: CVPR’22
DiG, ECCV’22
Text-DIAE, AAAI 2023
RCLSTR, MM 2023
DualMAE, ICDAR 2023
SelfDocSeg, ICDAR 2023
OCR is an important research problem in AI & CV
3

ICDAR WML 2023
Deep learning is going to be able to do everything？
https://www.technologyreview.com/2020/11/03/1011616/ai-godfather-
geoffrey-hinton-deep-learning-will-do-everything/
Deep Learning is everywhere.
CNNs are everywhere!
Deep Learning have also become one of the most dominant methodology in the fields of OCR
4

ICDAR WML 2023
图像来源及相关材料：
“…The emergence of GPUs and the availability of large datasets were key enablers of
deep learning…”
——Yoshua Bengio, Yann LeCun, and Geoffrey Hinton
• “Data is food for AI”
• “Tuning data is more important than tuning models 。”
• 80% Data + 20% Model = Better Machine Learning ——Andrew Ng
Data is a fundamental key to enable Deep Learning success
5
A commonly cited rule of thumb is that 80% of the success of a deep learning project is due to the quality and
quantity of available data, while the remaining 20% is due to the specific machine learning model being used

ICDAR WML 2023
Data Issue
• However, we cannot always obtain sufficient training data
• For instance, data involving personal privacy, financial data, government data…
• How to mitigate the high dependence of deep learning models on large data ？
• Data Synthesis/Data Augmentation
• Weakly Supervised Learning/ Weakly Annotated Data
• Self-Supervised Learning (SSL)
• …
6

ICDAR WML 2023
7
Data Driven
• Offline handwriting synthesis: SLOGAN: Handwriting Style Synthesis for Arbitrary-Length and Out-of-Vocabulary Text , IEEE TNNLS, 2022
• Online handwriting synthesis: ： SynSig2Vec: Forgery-free Learning of Dynamic Signature Representations and 1D CNN, IEEE TPAMI 2022
How to mitigate the high dependence of deep learning models on large data ？
Data Synthesis/Data Augmentation
Weakly Supervised Learning/ Weakly Annotated Data
Self-Supervised Learning (SSL)
Outline

8
Handwriting Style Synthesis for Arbitrary-Length Text
Imbalanced distribution of handwriting styles
in the IAM dataset
(a) (b)
Different writers Same writer
l Motivation!
Ø Existing data synthesis methods do not provide a good guarantee of handwriting style diversity
Ø Biased training datasets make it difficult to train models with good performance
Ø the distribution of handwriting styles from the popular IAM training set [3] reveals significant biases in terms of both the style
and frequency distribution
8
Data from the popular IAM benchmark

9
• Style representation : we propose a style bank to parameterize the specific handwriting styles as latent vectors
• Writing Diversity: The handwriting style is parameterized; after the training is completed, the style parameters are
randomly adjusted to obtain a variety of new styles.
• Content embedding of the text: Input text, output handwriting image
SLOGAN: Handwriting Style Synthesis for Arbitrary-Length and OOV Text
Canjie Luo, Yuanzhi Zhu, Lianwen Jin, et al. SLOGAN: Handwriting Style Synthesis for Arbitrary-Length and Out-of-Vocabulary Text, IEEE TNNLS, 2022.
9
SLOGAN

10
Style bank of parameterized handwriting style vectors
10
• The style bank embeds the writer ID into a latent vector, which is taken by the generator to transfer the
printed style to the corresponding handwriting style.
• The style bank can be learnt automatically.
• The style bank and the generator are jointly optimized under the supervision of the dual discriminators.

11
Character Style Discriminator:
Discriminator
11
Content Discriminator:
Cursive Joint Discriminator:
Writer ID Discriminator:
We design two types of
discriminators to help train
our GAN based model, which
consists of a separated
character discriminator and a
cursive discriminator.

12
l Handwriting styles are parameterized as latent vector , in which the element zk is
manipulatable to control the generated styles.
l We can interpolate the entire vector z between two random values achieve style interpolation
l Manipulating certain elements zk to achieve special attribute changes
Experiments: Style Diversity
• Stroke Thickness
• Paper Background
• Character Slant
12

13
Different adjacent character interval curved text
Experiments: Synthesis of arbitrary text
13

14
Input：
Output（three styles）：
Experiments: synthesis of arbitrary text
14

ICDAR WML 2023
Comparison with existing methods
15
GAN Metric Recognition Metric
WER : Word Error Rate
CER : Character Error Rate
FID: Frechet Inception Distance
GS: Geometric-Score
Canjie Luo, Yuanzhi Zhu, Lianwen Jin, et al. SLOGAN: Handwriting Style Synthesis for Arbitrary-Length and Out-of-Vocabulary Text. IEEE TNNLS, 2022.

ICDAR WML 2023
Comparison with existing methods
16

ICDAR WML 2023
Distribution of the handwriting styles
17
l Distribution of the handwriting styles of the word “the” via t-SNE.
Distribution of the existing styles in the IAM dataset Distribution after adding our generated samples
• From left, it can be seen that the large amount of empty space in the original distribution suggests the
limitation of styles. From right, it can be seen that with our generated various styles, the distribution is
more even and reasonable, which indicates the bias of the style is significantly rectified

18
SynSig2Vec： A New Forgery-free Dynamic Signature Representation
Learning Method for Signature Verification
- Songxuan Lai, Lianwen Jin, et al., SynSig2Vec: Learning Representations from Synthetic Dynamic Signatures for Real-world Verification, AAAI 2020
- Songxuan Lai, Lianwen Jin, et al., SynSig2Vec: Forgery-free Learning of Dynamic Signature Representations and 1D CNN, IEEE TPAMI 2022
2
l Handwriting signature verification is an important biometric problem and has wide applications
l Signatures may be skillfully imitated by a forger
l Skilled forgery data are difficulty to collect
l SynSig2Vec requires only genuine signatures for training, yet achieves SOTA performance
Code: https://github.com/LaiSongxuan/SynSig2Vec

ICDAR WML 2023 19
Signature Parameterization using Sigma Lognormal
• The kinematic theory of rapid human movements suggests that human handwriting consists in controlling the pen-
tip velocity with overlapped lognormal impulse responses.
• The magnitude and direction of the velocity v_i(t) of the ith stroke is described as the following lognormal:
• Velocity of a signature can be modeled as the sigma lognormal:
• Six parameters for stroke i:
• A handwriting signature can be parameterized as:
The velocity profile of a typical human handwriting
component consists of lognormal impulse responses

ICDAR WML 2023 20
Signature Synthesis by perturbating the parameter matrix P
𝑅!, 𝑅"#, 𝑅$, 𝑅%, 𝑅&', 𝑅&( are empirical determined random parameters
• Given the parameter matrix P, a trajectory can be recovered as follows:
• Introducing random perturbations to the parameter matrix P to generate distorted signatures

ICDAR WML 2023
Signature Distortion Levels
21
Configurations of the random variables that decide the signature perturbations levels.
• By carefully setting the perturbations level, we can generate three categories of signatures:
- Low distorted samples: used as data augmentation of the true signatures
- Middle distorted samples: used as skilled forgeries
- High distorted samples: used as random forgeries
• By this way, we can generate the negative training data based on only the
genuine data to train a deep learning model

ICDAR WML 2023
Sig2Vec : An effective signature feature representation NN
22
𝐹 → {𝐹)})*+
,!"
，𝐹) ∈ ℛ|.|×!!"
vec) = softmax
𝑤)𝐹)
0
𝐷12
𝐹)
6
vec = [vec+, …, vec,!"
The architecture of the proposed Sig2Vec model
• Sig2Vec model extracts holistic representations from time functions of online writing signatures
• The model is trained using both cross entropy and average precision loss function

ICDAR WML 2023 23
Average Precision Optimization for Ranking
l We compute and rank similarities of different signatures and incorporate the Average Precision of the
ranking into the loss function for optimization.
Ø As AP (Average Precision) is non-differential to the network’s outputs, we use the weight update rule for AP
optimization according the the General Loss Gradient Theorem[1]
[1] Y. Song, A. Schwing, R. Urtasun et al., Training deep neural networks via direct loss minimization, ICML 2016

ICDAR WML 2023 24
Experimental Results
l Effectiveness of Average Precision Optimization
Ø Comparing with traditional BCE or Triplet loss, AP achieves lower error rate and converges faster

ICDAR WML 2023
25
Songxuan Lai, Lianwen Jin, et al., SynSig2Vec: Forgery-free Learning of Dynamic Signature Representations …and 1D CNN, IEEE TPAMI 2022
Datasets Methods
Number of
templates
Threshold
Global User-specific
MCYT-100
SRSS based on ΣΛ model[1] 1 13.56 -
Stroke-RNN[2] 1 10.46 -
SynSig2Vec (ours) 1 3.84 1.59
Symbolic representation[3] 5 5.70 2.20
DTW cost matrix information[4] 5 2.76 1.15
DTW with SCC[5] 5 - 2.15
Deep DTW[6] 5 2.40 -
Single-template strategy[7] 5 - 1.28
Single-template strategy+LS-DTW[8] 5 - 0.72
SVC-Task2
SRSS based on ΣΛ model[1] 1 18.25 -
DTW cost matrix information[4] 5 7.80 2.53
DTW with SCC[5] 5 - 2.63
Single-template strategy[7] 5 - 2.98
Single-template strategy+LS-DTW[8] 5 - 2.08
[1] Diaz M, et al. IEEE TCYB 2016.
[2] Li C, et al. ICDAR 2019.
[3] Guru D, et al. ESWA 2017.
[4] Sharma A, et al. IEEE TCYB 2017.
[5] Xia X, et al. PR 2017.
[6] Wu X, et al. ICDAR 2019.
[7] Okawa. PR 2020.
[8] Okawa. PR 2021.
Code: https://github.com/LaiSongxuan/SynSig2Vec
• Our method achieves state-of-the-art results
on two widely used benchmarks, the MCT
and SVC datasets.
• The advantage of out method is that it
requires only genuine signatures for
training, which is more practical for
developing deep learning-based real-world
systems

ICDAR WML 2023
Weakly Supervised Text Recognition
How to mitigate the high dependence of deep learning models on large labeled data ？
Outline

ICDAR WML 2023
SPTS: Single-Point Text Spotting
• End-to-end scene text spotting has made significant progress in recent years.
• To train an end-to-end text spotting model, existing methods commonly regard manual annotations such as horizontal
rectangles, rotated rectangles, quadrangles, and polygons as a prerequisite, which are very expensive
• We show that text spotters can be supervised by a simple yet effective single-point representation (single point+transcription).
27
Dezhi Peng, et al., Lianwen Jin*, SPTS: Single-Point Text Spotting, ACM MM 2022. oral
1
Code: https://github.com/shannanyinxiang/SPTS

ICDAR WML 2023
SPTS： Single-Point Text Spotting
• A new simple yet effective Transformer-based scene text spotter, inspired by [1].
a) The scene text spotting is formed as a language modeling task.
b) An intuitive assumption: if a deep learning model knows what and where the text instances are, it can be taught to
generate the desired sequence of results (locations and transcriptions)
• The overall framework:
a) CNN + Transformer encoder extract the visual and context features
b) Transformer decoder predicts a sequence that is subsequently translated into points and transcriptions.
c) The complex post-processing and RoI operations are avoided, and better fusion of text detection and recognition is
achieved.
28
Dezhi Peng, et al., Lianwen Jin*, SPTS: Single-Point Text Spotting, ACM MM 2022
[1] T. Chen, S. Saxena, L. Li, D Fleet, G. Hinton, Pix2seq: A Language Modeling Framework for Object Detection, ICLR 2022.

ICDAR WML 2023
Sequence Construction
• The transcriptions are naturally discrete which consist
of character categories.
• The continuous coordinates 𝑥, 𝑦 of points are
discretized into integers between 1, 𝑛3)4' .
• To solve the problem that the transcriptions are of
various lengths, we pad (using <PAD> tokens) or
truncate them to a fixed length 𝑙"5.
• The sequences of text instances are randomly ordered
and then concatenated.
• The <SOS> (start of seq.) and <EOS> (end of seq.) are
inserted to the head and tail of the sequence, respectively.
29
How to construct the output sequence

ICDAR WML 2023
Training
• The input sequence and output sequence of the decoder:
30
• The model is trained to maximize the likelihood loss given in Eq. (1), where 𝐼 is the input image, ̃
𝑠 is the output
sequence, 𝑠 is the input sequence, and 𝑤! is set to 1.

ICDAR WML 2023
Inference
• The model auto-regressively predict the sequence until the <EOS> occurs.
• The predicted sequence is divided into multiple segments, each of which contains 2 + 𝑙"# tokens (2 tokens for the
coordinate of the point and 𝑙"# tokens for the transcription).
• The segments are translated into the point coordinates and transcriptions of text instances.
• The average likelihood of the tokens in each segment is assigned as the score of the corresponding text instance.
31

ICDAR WML 2023
Experimental Datasets
• Curved Synthetic Dataset 150k[1]: 150k samples with 1/3 curved texts and 2/3 horizontal texts.
• ICDAR 2013[2]: 229 training and 233 testing samples with horizontal texts.
• ICDAR 2015[3]: 1000 training and 500 testing samples with multi-oriented texts.
• Total-Text[4]: 1255 training and 300 testing samples with arbitrarily shaped texts.
• SCUT-CTW1500[5]: 1000 training and 500 testing samples with arbitrarily shaped texts.
32
[1] Y. Liu, et al. ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network. CVPR 2020.
[2] D. Karatzas, et al. ICDAR 2013 robust reading competition. ICDAR 2013. [4] C. K. Ch’ng, et al. Total-Text: A comprehensive dataset for scene text detection and recognition, ICDAR 2017.
[3] D. Karatzas, et al. ICDAR 2015 competition on robust reading. ICDAR 2015. [5] Y. Liu, et al. Curved scene text detection via transverse and longitudinal sequence connection. PR 2019.
(a) ICDAR 2013 (b) ICDAR 2015 (c) Total-Text (d) SCUT-CTW1500

ICDAR WML 2023
33

ICDAR WML 2023
Visualization
• Visualization results on Total-Text (the first row) and SCUT-CTW1500 (the second row) benchmarks.
34

ICDAR WML 2023
SPTS v2
35
Yuliang Liu, …., Lianwen Jin*, SPTS v2: Single-Point Text Spotting, (under review)
https://arxiv.org/abs/2301.01635
• A new Instance Assignment Decoder (IAD)
• A new Parallel Recognition Decoder

ICDAR WML 2023
No-Point Text Spotting
• We further show that SPTS can be trained even without the supervision of
single-point annotations
• No-Point Text Spotting (NPTS) model is obtained by removing the point
coordinates from the constructed sequence.
• The right figure shows the qualitative results of NPTS. The table below
compares SPTS and NPTS.
• The experimental results indicate that the NPTS model can learn the ability to
implicitly find out the locations of text merely based on transcriptions.
36

ICDAR WML 2023
Extension of SPTS：Single-Point Object Detection
37
• The single-point object detection experiments are conducted using the Pascal VOC object detection dataset.
• The model is trained with central points and corresponding categories.
• Preliminary qualitative results on the validation set are shown in Figure 10.
• The singe point might be viable to provide extremely low-cost annotation for general object detection.

PageNet: Weakly Supervised HCTR
• We propose PageNet which is the first method for end-to-end weakly-supervised page-level HCTR.
• The model is trained without any bounding box, i.e. only given the line-level transcripts, however, it can output segmentation and
recognition at both line-level and character-level.
• To the best of our knowledge, PageNet is the first method to address the reading order problem in page level HCTR. The model
can handle pages with multidirectional reading order and arbitrarily curved text lines.
你的價值。
不會在别人的肯定上
你怎麽看自己。
才是最重要
你的價值。
不會在
别人的肯定上
你怎
麽看自己。
才是
最重
要
Input Supervision Output
Dezhi Peng, Lianwen Jin*, et al., PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition, IJCV, 2022
Code: https://github.com/shannanyinxiang/PageNet
2
PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition (HCTR)
38

Annotation comparison
• Comparison of the required annotations versus the model output of existing page-level methods
Dezhi Peng, Lianwen Jin*, et al., PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition, IJCV 2022
• Bluche T., Joint line segmentation and transcription for end-to-end handwritten paragraph recognition, NIPS, 2016.
• Yousef M., et al., OrigamiNet : Weakly-Supervised , Segmentation-Free , One-Step , Full Page Text Recognition by learning to unfold, CVPR, 2020.
• Wigington C., et al., Start, follow, read: End-to-end full-page handwriting recognition, ECCV, 2018.
• Huang Y. et al., Adversarial feature enhancing network for end-to-end handwritten paragraph recognition, ICDAR 2019
• Ma W., et al., Joint layout analysis, character detection and recognition for historical document digitization, ICFHR 2020
39

ICDAR WML 2023
Architecture of PageNet
40
Dezhi Peng, Lianwen Jin*, et al., PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition, IJCV 2022
Code: https://github.com/shannanyinxiang/PageNet

ICDAR WML 2023
Reading Order Module of PageNet
• Given an unordered set of characters, the reading order problem is to determine the order in which characters
are read.
41
• We solve the reading order problem by
making three predictions, namely the start-
of-line distribution, the 4-directional
reading order prediction and the end-of-
line distribution
• From a character, we can find the next
character in the reading order by iteratively
moving from a grid to the next according to
the direction with maximum probability
until arriving at a new character, as
illustrated in the visualization of search
paths in Fig. 3.

ICDAR WML 2023
Graph-based Decoding Algorithm
42
Pipeline of the graph-based decoding algorithm.
• Nodes. Each character detection and recognition
result is viewed as a node. Therefore, each node
corresponds to a grid in which the bounding box
and category of character are predicted.
• Edges. Based on the 4-directional reading order
prediction, we find the next node of every node.
• Reading Order. We distinguish whether a node is
the start-of-line or the end-of-line according to the
start-of-line distribution and the end-of-line
distribution. Then, the reading order is represented
by the paths that start at the start-of-line and end
at the end-of-line.

ICDAR WML 2023
Weakly Supervised Learning
43
• Matching: match the results of PageNet with the line-level transcripts in the annotations to find reliable results
• Weakly annotated data using font is used
• Updating: Use the reliable results to update pseudo-labels
• Optimization: Calculate the losses using the updated pseudo-labels to optimize the parameters.

Weakly Supervised Learning
• Optimization
Ø Detection branch
Ø Classification branch
Ø Location branch
Ø Start of line & End of line
Ø 4-directional reading order
• Total Loss
44

ICDAR WML 2023
45
• Det + Recog (Table 2): Faster R-CNN + RRPN + CRNN Recognizer
• Det + Recog (Table 4): Mask R-CNN + CRNN recognizer

Visualization (MTH v2)
47
Results on historical Chinese dataset

Visualization (SCUT-HCCDoc)
48
Results on camera-captured document images

Multi-directional Reading Order
0° 180°
90° 270°

Recognition of Curved Text Line
50

ICDAR WML 2023
Self-Supervised Learning for OCR
How to mitigate the high dependence of deep learning models on large labeled data ？
Outline

52
• Self-supervised learning (SSL) is a machine learning method for unlabeled data. By training with a pre-defined pretext task, a good
pre-trained model is obtained and can be used in various downstream tasks with enhanced performance
• The SSL can be used to avoid the extensive cost of collecting and annotating large-scale datasets
Longlong Jing and Yingli Tian, Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey, IEEE TPAMI 2020
A general Pipeline of SSL
Self-supervised Learning (SSL)

53
Timeline of Visual SSL
C. Zhang, A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond, arXiv 20220730, IJCAI 2023.

54
Discriminative SSL (eg. contrast learning) Generative SSL （e.g. Mask Image Modeling, MIM）
SSL methods can be divided into two main categories, i.e., discriminative and generative.
Typical methods：BYOL, SimCLR, MoCO，SimSiam… Typical methods ：BEiT, iBERT, CAE, MAE, SimMIM…

55
SSL based large Pre-training model for document understanding
Position Embeddings
(1D,2D)
Token Embeddings
Position Embeddings
(1D,2D)
Token Embeddings
Position Embeddings
(1D,2D)
SelfDoc
CVPR 2021
StructuralLM
ACL 2021
MVLM
CPC
Pre-training strategies
Modality-Adaptive
Self-Attention
MLM&MVM
MVLM
Modality-Adaptive
Self-Attention
Encoder
Modalities
LayoutLM
KDD 2020
Modality-Adaptive
Self-Attention
Token & Visual
Embeddings
Position Embeddings
(1D,2D)
LayoutLMv2
ACL 2021 TIM TIA
Modality-Adaptive
Self-Attention
MVLM
Spatial-Aware
Self-Attention
Cross-Modality
Encoder
Semantically meaningful
components Embeddings
MDC
Token & Visual
Segment Embedding
Position Embeddings
(1D,2D)
StrucTexT
ACMmm 2021 SLP PBD
MVLM
Modality-Adaptive
Self-Attention
Position Embeddings
(1D,2D)
DocFormer
ICCV.2021
Token & Visual
Embeddings LTR TDI
MVLM
Modality-Adaptive
Self-Attention
Multi-Modal
Self-Attention
Richer downstream task and multi-modal fusion：
• MVLM : Masked Visual Language Modeling
• MDC : Multi-label Document Classification
• CPC：Cell Position Classification
• SLP: Segment Length Prediction
• LTR: Learn To Reconstruct
• TDI: Text Describes Image
• TIA :Text-Image Alignment
• TIM: Text-Image Matching
• PBD: Paired Box Direction
• BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents, AAAI 2022
• LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding, ACL 2022
• ……
p SSL in the field of document understanding

56
SSL Methods in the field of OCR
SeqCLR[1]
[1] Aberdam A, Litman R, Tsiper S, et al. Sequence-to-sequence contrastive learning for text recognition, CVPR 2021
[2] Liu H, Wang B, Bao Z, et al. Perceiving Stroke-Semantic Context: Hierarchical Contrastive Learning for Robust Scene Text Recognition. AAAI 2022.
PerSec[2]
Based on Contrastive Learning
Based on SimCLR

57
SSL Methods in OCR
[3] H.Guo, et al., Primitive Contrastive Learning for Handwritten Mathematical Expression Recognition , ICPR 2022
[4] X. Jiang, et al., Scene Text Recognition with Self-supervised Contrastive Predictive Coding, ICPR 2022
[5] Xiaoyi Zhang, et al., CMT-Co: Contrastive Learning with Character Movement Task for Handwritten Text Recognition, ICFHR 2022
PrimCLR [3]
• Primitive contrastive learning
• Handwritten mathematic expression
STR-CPC [4] • Based on CPC (contrastive predictive
coding)
• A widthwise causal convolution is
designed to alleviate the information
overlap problem
• Progressive Recovery Training Strategy
(PRTS)
CMT-Co [5]
• Based on MoCo v2
• A new character unit cropping
module is designed

58
SSL Methods in the field of OCR
SimAN[6]
[6] Luo C, Jin L, Chen J. SimAN: Exploring Self-Supervised Representation Learning of Scene Text via Similarity-Aware Normalization, CVPR. 2022.
[7] Lyu P, et al. MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining[J]. arXiv preprint arXiv:2206.00311, 2022.
[8] Yang M, Liao M, Lu P, et al. Reading and Writing: Discriminative and Generative Modeling for Self-Supervised Text Recognition, ACM MM, 2022.
DiG[8]
Hybrid Generative & Discriminative SSL
Generative SSL
MaskOCR[7]

SSL Methods in OCR
• Relational Contrastive
Learning (RCL) for STR
• Hierarchy representation
learning (word, subword,
frame)
• Relational regularization,
hierarchical relations, inter-
hierarchy relational
consistency
[10] Jinglei Zhang, Tiancheng Lin, Relational Contrastive Learning for Scene Text Recognition, ACM MM 2023.
p RCLSTR
[9] Text-DIAE: A Self-Supervised Degradation Invariant Autoencoder for Text Recognition and Document Enhancement, AAAI 2023
p Text-DIAE
• A new SSL method called Text-
Degradation Invariant Auto
Encoder (Text-DIAE)
• Three pretext tasks (masking,
blur and background noise)
• Fast convergence
59

p SelfDocSeg
SSL Methods in OCR
[11] Subhajit Maity, et al., SelfDocSeg: A self-supervised vision-based approach towards Document Segmentation, ICDAR 2023
[12] Z. Qiao, et al., Decoupling Visual-Semantic Features Learning with Dual Masked Autoencoder for Self-Supervised STR, ICDAR 2023
p Dual-MAE
• Decouple visual and semantic feature
learning with different masking strategies
• A Siamese network is used to align the
dual features
• A new SSL method for document
layout analysis based on BYOL
• Pseudo-layouts （Layout mask) from
the document images are generated to
pre-train the image encoder
• Fine-tuning by an object detector
60

61
SSL Methods in OCR
Qing Jiang, Jiapeng Wang, Dezhi Peng, Chongyu Liu, Lianwen Jin*, Revisiting Scene Text Recognition: A Data Perspective, ICCV 2023
Paper Preprint: https://arxiv.org/abs/2307.08723
p MAEReg: A simple yet effective scene text recognizer based on ViT & MAE-SSL
• A ViT backbone, a Transformer decoder, training by the MAE self-supervised learning with some minor modifications
Code & Data are available at:
https://union14m.github.io

62
SSL Methods in OCR
Paper Preprint: https://arxiv.org/abs/2307.08723
p MAEReg: A simple yet effective scene text recognizer based on ViT & MAE-SSL
Code & Data are available at:
https://union14m.github.io

ICDAR WML 2023 63
Discussion & Prospects

ICDAR WML 2023
Is the problem of OCR/STR nearly solved?
• Great progress in the filed of OCR has been achieved in recent year
• The current progress in scene text recognition (STR) has exhibited a trend of accuracy saturation
1. Whether the common benchmarks remain sufficient to promote future progress?
2. Whether this accuracy saturation implies that STR is solved?
• The six widely used datasets (IC13, SVT, IIIT, IC15, SVTP, CUTE80) are relatively small in scale
• The six datasets lack representativeness of various real-world scenarios.
• The six datasets are less challenging, thus concealing the underlying issues that STR faces
64
1

ICDAR WML 2023
Revisiting OCR: A Data Perspective
• We consolidate a large-scale real STR dataset, Union14M, to investigate the challenges faced by STR models in the real world
• The Union14M benchmark contains two subsets, Union14M-L, Union14M-U
• Union14M-L: 4M labeled images, 3,230,742 training samples, 400,000 validation samples, 409,383 testing samples
• Union14M-U: 10M unlabeled images
13 representative STR models (trained on MJ & ST
synthetic datasets) evaluate on Union14M-L show a
sigificant performance degradation on the Union14M-
L test set.
Qing Jiang, Jiapeng Wang, Dezhi Peng, Chongyu Liu, Lianwen Jin*,
Revisiting Scene Text Recognition: A Data Perspective, ICCV 2023

ICDAR WML 2023
Revisiting OCR: A Data Perspective (cont.)
• Experiments on the new Union14M-L benchmark show that STR is still far from being solved
• Training with the new Union14M dataset produces much better performance
Left table: Recognition accuracy of models
trained on synthetic datasets (MJ and ST)
Right table: Recognition accuracy of models
trained on the training set of Union14M-L.
For MAERec, S and B represent the use of
ViT-Small and ViT-Base as the backbone,
respectively. PT denotes pre-training
Qing Jiang, Jiapeng Wang, Dezhi Peng, Chongyu
Liu, Lianwen Jin*, Revisiting Scene Text
Recognition: A Data Perspective, ICCV 2023
66

ICDAR WML 2023
Revisiting OCR: A Data Perspective (cont.)
• Self-supervised pre-training is an effective way to utilize massive amounts of unlabeled data (the MAERec model)
• Quality of dataset is more important than the quantity of dataset
Code & Data are available at: https://union14m.github.io
Paper preprint: https://arxiv.org/abs/2307.08723
67

ICDAR WML 2023
GPT4: Potential OCR ability of LLM?
68
• On Nov.30, 2022, OpenAI released the ChatGPT large-scale language model (LLM)
• GPT-4 (released on March 14, 2023) already show great potential for OCR
2

ICDAR WML 2023
Multimodel LLM Demostrate Amazong Zero-shot OCR Capabilities
Q. Ye, H. Xu, G. Xu, et al., mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality, arXiv 2023.04
69
Large multimodal models are a continuing research hotspot recently

ICDAR WML 2023
Potential and limitations of multimodal LLM in the OCR
Yuliang Liu, Zhang Li, et al., On the Hidden Mystery of OCR in Large Multimodal Models, arXiv 2023
• A comprehensive study of existing publicly available multimodal models, evaluating their performance in various OCR tasks
• The preliminary assessment reveals that LMM can achieve promising results, especially in text recognition
70

ICDAR WML 2023
CLIP for Scene Text Detection/Recognition
W. Yu, Y, Liu, .. Xiang Bai*, Turning a CLIP Model into a Scene Text
Detector, CVPR 2023
Shuai Zhao, et al., CLIP4STR: A Simple Baseline for Scene Text
Recognition with Pre-trained Vision-Language Model, arXiv , 2023.05.23
• Introducing multimodal large models (e.g. CLIP) to assist in improving text detection and recognition performance
is another promising direction in recent research
71
3

ICDAR WML 2023
Toward a Unified Document Understanding Model（UDoP）
Zineng Tang, Ziyi Yang, et al., Unifying Vision, Text, and Layout for Universal Document Processing, CVPR 2023
• Five major problem categories,
more than 10 Document AI
subtasks
• SOTA on 8 Document AI tasks
• Designing a unified OCR model that can handle different tasks is also an important research topic
72
4

73
Conclusion
l OCR has been an active research field for over 40 years. It has become one of the core AI technologies in
many industry applications
l Many problems are still not completely solved
Ø Complex layout analysis, understanding and reconstruction of complex document
Ø End-to-end Visual Information Extraction (VIE)
Ø Table recognition & understanding in the wild
Ø Chart analysis and understanding
Ø TextVQA, DocVQA in the wild
Ø Robust tampered text detection in document Image
l Large-scale pre-training model and SSL for OCR are important future research topics
Ø OCR Big Model / OCR Fundamental Model
ü Universal OCR model that can deal with various OCR tasks
73

Thank you！
August 26, 2023
Lianwen Jin
Email!eelwjin@scut.edu.cn (primary); lianwen.jin@gmail.com (secondary)
URL ! http://www.dlvc-lab.net/lianwen/
Lab of Deep Learning & Vision Computing
South China University of Technology
ICDAR WML 2023

Optical Character Recognition: from data driven to self-supervised learning (ICDAR WML 2023 Keynote)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Optical Character Recognition: from data driven to self-supervised learning (ICDAR WML 2023 Keynote)

Similar to Optical Character Recognition: from data driven to self-supervised learning (ICDAR WML 2023 Keynote) (20)

Recently uploaded

Recently uploaded (11)

Optical Character Recognition: from data driven to self-supervised learning (ICDAR WML 2023 Keynote)