SlideShare a Scribd company logo
1 of 74
Download to read offline
Optical Character Recognition
From data-driven to self-supervised learning
Lianwen Jin
South China University of Technology
http://www.dlvc-lab.net/lianwen/
August 26, 2023
ICDAR WML 2023
ICDAR WML 2023
Outline
l Introduction
l Data Synthesis for OCR
• Handwriting Style Synthesis for Arbitrary-Length and Out-of-Vocabulary Text
• Forgery- Synthesis for Handwritten Signature Verification
l Weakly Supervised Learning Methods for OCR
• Single-Point Text Spotting (SPTS )
• Weakly Supervised Handwritten Chinese Text Recognition (Page-level)
l Self-Supervised Learning Methods for OCR
• A Brief Review of SSL in OCR
• Revisiting Scene Text Recognition: A Data Perspective
l Discussion and Conclusion
2
• OCR (Optical Character Recognition) is an important research field in PR and AI
• Scene Text Recognition (STR) has attracted great research attention in recent years
• Research trends: from visually perceptive to semantically driven, from strongly supervised to self-supervised learning
Irregular Scene Text Recognition
RARE, CVPR’16
ASTER, TPAMI’18
MORAN, PR’19
ESIR, CVPR’19
ScRN, ICCV’19
SAR, AAAI’19
Encoder/Decoder Model
FAN, ICCV’17
DAN, AAAI’20
GTC, AAAI’20
ACE, CVPR ‘20
RobustScanner, ECCV’20
SCATTER, CVPR’20
IFA, CVPR’21
Segmentation-based
CA-FCN, AAAI’19
Mask TextSpotter v2, TPAMI’19
TextScanner, AAAI’20
SegHCCR, TMM’22
PageNet, IJCV 20222
SegCTC, ICDAR 2023
Image Enhancement
PlugNet, ECCV’20
SPIN, AAAI’21
STT, CVPR’21
TATT, CVPR ’22
PSRB-DIP, ICDAR ‘23
Data Synthesis
Synth90K, IJCV’16
SynthText, CVPR’16
VerisimilarText, ECCV’18
UnRealText, CVPR’20
ScrabbleGAN, CVPR’20
GANwriting, ECCV’20, TPAMI’22
HiGAN, AAAI’21; HiGAN++, TOG’22
VATr, CVPR 2023
GC-DDPM, ICDAR 2023
Language Modeling
SRN, CVPR’20; SEED, CVPR’20
DictGuide, CVPR’21
Bhunia et al., CVPR’21
FromTwoToOne, ICCV’21
ABINet/++, CVPR’21, TPAMI’22
ViSA, ICDAR 2023
CLIP-TCM, CVPR 2023
Self-Supervised Learning
Bhunia et al., CVPR’21
Baek et al., CVPR’21; SeqCLR, CVPR’21
ConCLR, AAAI’22
SimAN: CVPR’22
DiG, ECCV’22
Text-DIAE, AAAI 2023
RCLSTR, MM 2023
DualMAE, ICDAR 2023
SelfDocSeg, ICDAR 2023
OCR is an important research problem in AI & CV
3
ICDAR WML 2023
Deep learning is going to be able to do everything?
https://www.technologyreview.com/2020/11/03/1011616/ai-godfather-
geoffrey-hinton-deep-learning-will-do-everything/
Deep Learning is everywhere.
CNNs are everywhere!
Deep Learning have also become one of the most dominant methodology in the fields of OCR
4
ICDAR WML 2023
图像来源及相关材料:
“…The emergence of GPUs and the availability of large datasets were key enablers of
deep learning…”
——Yoshua Bengio, Yann LeCun, and Geoffrey Hinton
• “Data is food for AI”
• “Tuning data is more important than tuning models 。”
• 80% Data + 20% Model = Better Machine Learning ——Andrew Ng
Data is a fundamental key to enable Deep Learning success
5
A commonly cited rule of thumb is that 80% of the success of a deep learning project is due to the quality and
quantity of available data, while the remaining 20% is due to the specific machine learning model being used
ICDAR WML 2023
Data Issue
• However, we cannot always obtain sufficient training data
• For instance, data involving personal privacy, financial data, government data…
• How to mitigate the high dependence of deep learning models on large data ?
• Data Synthesis/Data Augmentation
• Weakly Supervised Learning/ Weakly Annotated Data
• Self-Supervised Learning (SSL)
• …
6
ICDAR WML 2023
7
Data Driven
• Offline handwriting synthesis: SLOGAN: Handwriting Style Synthesis for Arbitrary-Length and Out-of-Vocabulary Text , IEEE TNNLS, 2022
• Online handwriting synthesis: : SynSig2Vec: Forgery-free Learning of Dynamic Signature Representations and 1D CNN, IEEE TPAMI 2022
How to mitigate the high dependence of deep learning models on large data ?
Data Synthesis/Data Augmentation
Weakly Supervised Learning/ Weakly Annotated Data
Self-Supervised Learning (SSL)
Outline
8
Handwriting Style Synthesis for Arbitrary-Length Text
Imbalanced distribution of handwriting styles
in the IAM dataset
(a) (b)
Different writers Same writer
l Motivation!
Ø Existing data synthesis methods do not provide a good guarantee of handwriting style diversity
Ø Biased training datasets make it difficult to train models with good performance
Ø the distribution of handwriting styles from the popular IAM training set [3] reveals significant biases in terms of both the style
and frequency distribution
8
Data from the popular IAM benchmark
9
• Style representation : we propose a style bank to parameterize the specific handwriting styles as latent vectors
• Writing Diversity: The handwriting style is parameterized; after the training is completed, the style parameters are
randomly adjusted to obtain a variety of new styles.
• Content embedding of the text: Input text, output handwriting image
SLOGAN: Handwriting Style Synthesis for Arbitrary-Length and OOV Text
Canjie Luo, Yuanzhi Zhu, Lianwen Jin, et al. SLOGAN: Handwriting Style Synthesis for Arbitrary-Length and Out-of-Vocabulary Text, IEEE TNNLS, 2022.
9
SLOGAN
10
Style bank of parameterized handwriting style vectors
10
• The style bank embeds the writer ID into a latent vector, which is taken by the generator to transfer the
printed style to the corresponding handwriting style.
• The style bank can be learnt automatically.
• The style bank and the generator are jointly optimized under the supervision of the dual discriminators.
11
Character Style Discriminator:
Discriminator
11
Content Discriminator:
Cursive Joint Discriminator:
Writer ID Discriminator:
We design two types of
discriminators to help train
our GAN based model, which
consists of a separated
character discriminator and a
cursive discriminator.
12
l Handwriting styles are parameterized as latent vector , in which the element zk is
manipulatable to control the generated styles.
l We can interpolate the entire vector z between two random values achieve style interpolation
l Manipulating certain elements zk to achieve special attribute changes
Experiments: Style Diversity
• Stroke Thickness
• Paper Background
• Character Slant
12
13
Different adjacent character interval curved text
Experiments: Synthesis of arbitrary text
13
14
Input:
Output(three styles):
Experiments: synthesis of arbitrary text
14
ICDAR WML 2023
Comparison with existing methods
15
GAN Metric Recognition Metric
WER : Word Error Rate
CER : Character Error Rate
FID: Frechet Inception Distance
GS: Geometric-Score
Canjie Luo, Yuanzhi Zhu, Lianwen Jin, et al. SLOGAN: Handwriting Style Synthesis for Arbitrary-Length and Out-of-Vocabulary Text. IEEE TNNLS, 2022.
ICDAR WML 2023
Comparison with existing methods
16
ICDAR WML 2023
Distribution of the handwriting styles
17
l Distribution of the handwriting styles of the word “the” via t-SNE.
Distribution of the existing styles in the IAM dataset Distribution after adding our generated samples
• From left, it can be seen that the large amount of empty space in the original distribution suggests the
limitation of styles. From right, it can be seen that with our generated various styles, the distribution is
more even and reasonable, which indicates the bias of the style is significantly rectified
18
SynSig2Vec: A New Forgery-free Dynamic Signature Representation
Learning Method for Signature Verification
- Songxuan Lai, Lianwen Jin, et al., SynSig2Vec: Learning Representations from Synthetic Dynamic Signatures for Real-world Verification, AAAI 2020
- Songxuan Lai, Lianwen Jin, et al., SynSig2Vec: Forgery-free Learning of Dynamic Signature Representations and 1D CNN, IEEE TPAMI 2022
2
l Handwriting signature verification is an important biometric problem and has wide applications
l Signatures may be skillfully imitated by a forger
l Skilled forgery data are difficulty to collect
l SynSig2Vec requires only genuine signatures for training, yet achieves SOTA performance
Code: https://github.com/LaiSongxuan/SynSig2Vec
ICDAR WML 2023 19
Signature Parameterization using Sigma Lognormal
• The kinematic theory of rapid human movements suggests that human handwriting consists in controlling the pen-
tip velocity with overlapped lognormal impulse responses.
• The magnitude and direction of the velocity v_i(t) of the ith stroke is described as the following lognormal:
• Velocity of a signature can be modeled as the sigma lognormal:
• Six parameters for stroke i:
• A handwriting signature can be parameterized as:
The velocity profile of a typical human handwriting
component consists of lognormal impulse responses
ICDAR WML 2023 20
Signature Synthesis by perturbating the parameter matrix P
𝑅!, 𝑅"#, 𝑅$, 𝑅%, 𝑅&', 𝑅&( are empirical determined random parameters
• Given the parameter matrix P, a trajectory can be recovered as follows:
• Introducing random perturbations to the parameter matrix P to generate distorted signatures
ICDAR WML 2023
Signature Distortion Levels
21
Configurations of the random variables that decide the signature perturbations levels.
• By carefully setting the perturbations level, we can generate three categories of signatures:
- Low distorted samples: used as data augmentation of the true signatures
- Middle distorted samples: used as skilled forgeries
- High distorted samples: used as random forgeries
• By this way, we can generate the negative training data based on only the
genuine data to train a deep learning model
ICDAR WML 2023
Sig2Vec : An effective signature feature representation NN
22
𝐹 → {𝐹)})*+
,!"
,𝐹) ∈ ℛ|.|×!!"
vec) = softmax
𝑤)𝐹)
0
𝐷12
𝐹)
6
vec = [vec+, …, vec,!"
The architecture of the proposed Sig2Vec model
• Sig2Vec model extracts holistic representations from time functions of online writing signatures
• The model is trained using both cross entropy and average precision loss function
ICDAR WML 2023 23
Average Precision Optimization for Ranking
l We compute and rank similarities of different signatures and incorporate the Average Precision of the
ranking into the loss function for optimization.
Ø As AP (Average Precision) is non-differential to the network’s outputs, we use the weight update rule for AP
optimization according the the General Loss Gradient Theorem[1]
[1] Y. Song, A. Schwing, R. Urtasun et al., Training deep neural networks via direct loss minimization, ICML 2016
ICDAR WML 2023 24
Experimental Results
l Effectiveness of Average Precision Optimization
Ø Comparing with traditional BCE or Triplet loss, AP achieves lower error rate and converges faster
ICDAR WML 2023
Experimental Results
25
Songxuan Lai, Lianwen Jin, et al., SynSig2Vec: Forgery-free Learning of Dynamic Signature Representations …and 1D CNN, IEEE TPAMI 2022
Datasets Methods
Number of
templates
Threshold
Global User-specific
MCYT-100
SRSS based on ΣΛ model[1] 1 13.56 -
Stroke-RNN[2] 1 10.46 -
SynSig2Vec (ours) 1 3.84 1.59
Symbolic representation[3] 5 5.70 2.20
DTW cost matrix information[4] 5 2.76 1.15
DTW with SCC[5] 5 - 2.15
Deep DTW[6] 5 2.40 -
Single-template strategy[7] 5 - 1.28
Single-template strategy+LS-DTW[8] 5 - 0.72
SynSig2Vec (ours) 4 2.38 0.96
SVC-Task2
SRSS based on ΣΛ model[1] 1 18.25 -
SynSig2Vec (ours) 1 12.16 5.83
DTW cost matrix information[4] 5 7.80 2.53
DTW with SCC[5] 5 - 2.63
Single-template strategy[7] 5 - 2.98
Single-template strategy+LS-DTW[8] 5 - 2.08
SynSig2Vec (ours) 5 3.88 2.08
[1] Diaz M, et al. IEEE TCYB 2016.
[2] Li C, et al. ICDAR 2019.
[3] Guru D, et al. ESWA 2017.
[4] Sharma A, et al. IEEE TCYB 2017.
[5] Xia X, et al. PR 2017.
[6] Wu X, et al. ICDAR 2019.
[7] Okawa. PR 2020.
[8] Okawa. PR 2021.
Code: https://github.com/LaiSongxuan/SynSig2Vec
• Our method achieves state-of-the-art results
on two widely used benchmarks, the MCT
and SVC datasets.
• The advantage of out method is that it
requires only genuine signatures for
training, which is more practical for
developing deep learning-based real-world
systems
ICDAR WML 2023
Weakly Supervised Text Recognition
How to mitigate the high dependence of deep learning models on large labeled data ?
Data Synthesis/Data Augmentation
Weakly Supervised Learning/ Weakly Annotated Data
Self-Supervised Learning (SSL)
Outline
ICDAR WML 2023
SPTS: Single-Point Text Spotting
• End-to-end scene text spotting has made significant progress in recent years.
• To train an end-to-end text spotting model, existing methods commonly regard manual annotations such as horizontal
rectangles, rotated rectangles, quadrangles, and polygons as a prerequisite, which are very expensive
• We show that text spotters can be supervised by a simple yet effective single-point representation (single point+transcription).
27
Dezhi Peng, et al., Lianwen Jin*, SPTS: Single-Point Text Spotting, ACM MM 2022. oral
1
Code: https://github.com/shannanyinxiang/SPTS
ICDAR WML 2023
SPTS: Single-Point Text Spotting
• A new simple yet effective Transformer-based scene text spotter, inspired by [1].
a) The scene text spotting is formed as a language modeling task.
b) An intuitive assumption: if a deep learning model knows what and where the text instances are, it can be taught to
generate the desired sequence of results (locations and transcriptions)
• The overall framework:
a) CNN + Transformer encoder extract the visual and context features
b) Transformer decoder predicts a sequence that is subsequently translated into points and transcriptions.
c) The complex post-processing and RoI operations are avoided, and better fusion of text detection and recognition is
achieved.
28
Dezhi Peng, et al., Lianwen Jin*, SPTS: Single-Point Text Spotting, ACM MM 2022
Code: https://github.com/shannanyinxiang/SPTS
[1] T. Chen, S. Saxena, L. Li, D Fleet, G. Hinton, Pix2seq: A Language Modeling Framework for Object Detection, ICLR 2022.
ICDAR WML 2023
Sequence Construction
• The transcriptions are naturally discrete which consist
of character categories.
• The continuous coordinates 𝑥, 𝑦 of points are
discretized into integers between 1, 𝑛3)4' .
• To solve the problem that the transcriptions are of
various lengths, we pad (using <PAD> tokens) or
truncate them to a fixed length 𝑙"5.
• The sequences of text instances are randomly ordered
and then concatenated.
• The <SOS> (start of seq.) and <EOS> (end of seq.) are
inserted to the head and tail of the sequence, respectively.
29
How to construct the output sequence
ICDAR WML 2023
Training
• The input sequence and output sequence of the decoder:
30
• The model is trained to maximize the likelihood loss given in Eq. (1), where 𝐼 is the input image, ̃
𝑠 is the output
sequence, 𝑠 is the input sequence, and 𝑤! is set to 1.
ICDAR WML 2023
Inference
• The model auto-regressively predict the sequence until the <EOS> occurs.
• The predicted sequence is divided into multiple segments, each of which contains 2 + 𝑙"# tokens (2 tokens for the
coordinate of the point and 𝑙"# tokens for the transcription).
• The segments are translated into the point coordinates and transcriptions of text instances.
• The average likelihood of the tokens in each segment is assigned as the score of the corresponding text instance.
31
ICDAR WML 2023
Experimental Datasets
• Curved Synthetic Dataset 150k[1]: 150k samples with 1/3 curved texts and 2/3 horizontal texts.
• ICDAR 2013[2]: 229 training and 233 testing samples with horizontal texts.
• ICDAR 2015[3]: 1000 training and 500 testing samples with multi-oriented texts.
• Total-Text[4]: 1255 training and 300 testing samples with arbitrarily shaped texts.
• SCUT-CTW1500[5]: 1000 training and 500 testing samples with arbitrarily shaped texts.
32
[1] Y. Liu, et al. ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network. CVPR 2020.
[2] D. Karatzas, et al. ICDAR 2013 robust reading competition. ICDAR 2013. [4] C. K. Ch’ng, et al. Total-Text: A comprehensive dataset for scene text detection and recognition, ICDAR 2017.
[3] D. Karatzas, et al. ICDAR 2015 competition on robust reading. ICDAR 2015. [5] Y. Liu, et al. Curved scene text detection via transverse and longitudinal sequence connection. PR 2019.
(a) ICDAR 2013 (b) ICDAR 2015 (c) Total-Text (d) SCUT-CTW1500
ICDAR WML 2023
Experimental Results
33
ICDAR WML 2023
Visualization
• Visualization results on Total-Text (the first row) and SCUT-CTW1500 (the second row) benchmarks.
34
Dezhi Peng, et al., Lianwen Jin*, SPTS: Single-Point Text Spotting, ACM MM 2022. oral
ICDAR WML 2023
SPTS v2
35
Yuliang Liu, …., Lianwen Jin*, SPTS v2: Single-Point Text Spotting, (under review)
https://arxiv.org/abs/2301.01635
• A new Instance Assignment Decoder (IAD)
• A new Parallel Recognition Decoder
ICDAR WML 2023
No-Point Text Spotting
• We further show that SPTS can be trained even without the supervision of
single-point annotations
• No-Point Text Spotting (NPTS) model is obtained by removing the point
coordinates from the constructed sequence.
• The right figure shows the qualitative results of NPTS. The table below
compares SPTS and NPTS.
• The experimental results indicate that the NPTS model can learn the ability to
implicitly find out the locations of text merely based on transcriptions.
36
Dezhi Peng, et al., Lianwen Jin*, SPTS: Single-Point Text Spotting, ACM MM 2022. oral
Code: https://github.com/shannanyinxiang/SPTS
ICDAR WML 2023
Extension of SPTS:Single-Point Object Detection
37
• The single-point object detection experiments are conducted using the Pascal VOC object detection dataset.
• The model is trained with central points and corresponding categories.
• Preliminary qualitative results on the validation set are shown in Figure 10.
• The singe point might be viable to provide extremely low-cost annotation for general object detection.
Dezhi Peng, et al., Lianwen Jin*, SPTS: Single-Point Text Spotting, ACM MM 2022. oral
Code: https://github.com/shannanyinxiang/SPTS
PageNet: Weakly Supervised HCTR
• We propose PageNet which is the first method for end-to-end weakly-supervised page-level HCTR.
• The model is trained without any bounding box, i.e. only given the line-level transcripts, however, it can output segmentation and
recognition at both line-level and character-level.
• To the best of our knowledge, PageNet is the first method to address the reading order problem in page level HCTR. The model
can handle pages with multidirectional reading order and arbitrarily curved text lines.
你的價值。
不會在别人的肯定上
你怎麽看自己。
才是最重要
你 的 價 值 。
不 會 在
别 人 的 肯 定 上
你 怎
麽 看 自 己 。
才 是
最 重
要
Input Supervision Output
Dezhi Peng, Lianwen Jin*, et al., PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition, IJCV, 2022
Code: https://github.com/shannanyinxiang/PageNet
2
PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition (HCTR)
38
Annotation comparison
• Comparison of the required annotations versus the model output of existing page-level methods
Dezhi Peng, Lianwen Jin*, et al., PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition, IJCV 2022
• Bluche T., Joint line segmentation and transcription for end-to-end handwritten paragraph recognition, NIPS, 2016.
• Yousef M., et al., OrigamiNet : Weakly-Supervised , Segmentation-Free , One-Step , Full Page Text Recognition by learning to unfold, CVPR, 2020.
• Wigington C., et al., Start, follow, read: End-to-end full-page handwriting recognition, ECCV, 2018.
• Huang Y. et al., Adversarial feature enhancing network for end-to-end handwritten paragraph recognition, ICDAR 2019
• Ma W., et al., Joint layout analysis, character detection and recognition for historical document digitization, ICFHR 2020
39
ICDAR WML 2023
Architecture of PageNet
40
Dezhi Peng, Lianwen Jin*, et al., PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition, IJCV 2022
Code: https://github.com/shannanyinxiang/PageNet
ICDAR WML 2023
Reading Order Module of PageNet
• Given an unordered set of characters, the reading order problem is to determine the order in which characters
are read.
41
• We solve the reading order problem by
making three predictions, namely the start-
of-line distribution, the 4-directional
reading order prediction and the end-of-
line distribution
• From a character, we can find the next
character in the reading order by iteratively
moving from a grid to the next according to
the direction with maximum probability
until arriving at a new character, as
illustrated in the visualization of search
paths in Fig. 3.
ICDAR WML 2023
Graph-based Decoding Algorithm
42
Pipeline of the graph-based decoding algorithm.
• Nodes. Each character detection and recognition
result is viewed as a node. Therefore, each node
corresponds to a grid in which the bounding box
and category of character are predicted.
• Edges. Based on the 4-directional reading order
prediction, we find the next node of every node.
• Reading Order. We distinguish whether a node is
the start-of-line or the end-of-line according to the
start-of-line distribution and the end-of-line
distribution. Then, the reading order is represented
by the paths that start at the start-of-line and end
at the end-of-line.
ICDAR WML 2023
Weakly Supervised Learning
43
• Matching: match the results of PageNet with the line-level transcripts in the annotations to find reliable results
• Weakly annotated data using font is used
• Updating: Use the reliable results to update pseudo-labels
• Optimization: Calculate the losses using the updated pseudo-labels to optimize the parameters.
Weakly Supervised Learning
• Optimization
Ø Detection branch
Ø Classification branch
Ø Location branch
Ø Start of line & End of line
Ø 4-directional reading order
• Total Loss
44
ICDAR WML 2023
Experimental Results
45
• Det + Recog (Table 2): Faster R-CNN + RRPN + CRNN Recognizer
• Det + Recog (Table 4): Mask R-CNN + CRNN recognizer
Visualization (ICDAR13)
46
Visualization (MTH v2)
47
Results on historical Chinese dataset
Visualization (SCUT-HCCDoc)
48
Results on camera-captured document images
Multi-directional Reading Order
0° 180°
90° 270°
Recognition of Curved Text Line
50
ICDAR WML 2023
Self-Supervised Learning for OCR
How to mitigate the high dependence of deep learning models on large labeled data ?
Data Synthesis/Data Augmentation
Weakly Supervised Learning/ Weakly Annotated Data
Self-Supervised Learning (SSL)
Outline
52
• Self-supervised learning (SSL) is a machine learning method for unlabeled data. By training with a pre-defined pretext task, a good
pre-trained model is obtained and can be used in various downstream tasks with enhanced performance
• The SSL can be used to avoid the extensive cost of collecting and annotating large-scale datasets
Longlong Jing and Yingli Tian, Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey, IEEE TPAMI 2020
A general Pipeline of SSL
Self-supervised Learning (SSL)
53
Timeline of Visual SSL
C. Zhang, A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond, arXiv 20220730, IJCAI 2023.
Self-supervised Learning (SSL)
54
Discriminative SSL (eg. contrast learning) Generative SSL (e.g. Mask Image Modeling, MIM)
SSL methods can be divided into two main categories, i.e., discriminative and generative.
Typical methods:BYOL, SimCLR, MoCO,SimSiam… Typical methods :BEiT, iBERT, CAE, MAE, SimMIM…
Self-supervised Learning (SSL)
55
SSL based large Pre-training model for document understanding
Position Embeddings
(1D,2D)
Token Embeddings
Position Embeddings
(1D,2D)
Token Embeddings
Position Embeddings
(1D,2D)
SelfDoc
CVPR 2021
StructuralLM
ACL 2021
MVLM
CPC
Pre-training strategies
Modality-Adaptive
Self-Attention
MLM&MVM
MVLM
Modality-Adaptive
Self-Attention
Encoder
Modalities
LayoutLM
KDD 2020
Modality-Adaptive
Self-Attention
Token & Visual
Embeddings
Position Embeddings
(1D,2D)
LayoutLMv2
ACL 2021 TIM TIA
Modality-Adaptive
Self-Attention
MVLM
Spatial-Aware
Self-Attention
Cross-Modality
Encoder
Semantically meaningful
components Embeddings
MDC
Token & Visual
Segment Embedding
Position Embeddings
(1D,2D)
StrucTexT
ACMmm 2021 SLP PBD
MVLM
Modality-Adaptive
Self-Attention
Position Embeddings
(1D,2D)
DocFormer
ICCV.2021
Token & Visual
Embeddings LTR TDI
MVLM
Modality-Adaptive
Self-Attention
Multi-Modal
Self-Attention
Richer downstream task and multi-modal fusion:
• MVLM : Masked Visual Language Modeling
• MDC : Multi-label Document Classification
• CPC:Cell Position Classification
• SLP: Segment Length Prediction
• LTR: Learn To Reconstruct
• TDI: Text Describes Image
• TIA :Text-Image Alignment
• TIM: Text-Image Matching
• PBD: Paired Box Direction
• BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents, AAAI 2022
• LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding, ACL 2022
• ……
p SSL in the field of document understanding
56
SSL Methods in the field of OCR
SeqCLR[1]
[1] Aberdam A, Litman R, Tsiper S, et al. Sequence-to-sequence contrastive learning for text recognition, CVPR 2021
[2] Liu H, Wang B, Bao Z, et al. Perceiving Stroke-Semantic Context: Hierarchical Contrastive Learning for Robust Scene Text Recognition. AAAI 2022.
PerSec[2]
Based on Contrastive Learning
Based on SimCLR
57
SSL Methods in OCR
[3] H.Guo, et al., Primitive Contrastive Learning for Handwritten Mathematical Expression Recognition , ICPR 2022
[4] X. Jiang, et al., Scene Text Recognition with Self-supervised Contrastive Predictive Coding, ICPR 2022
[5] Xiaoyi Zhang, et al., CMT-Co: Contrastive Learning with Character Movement Task for Handwritten Text Recognition, ICFHR 2022
PrimCLR [3]
• Primitive contrastive learning
• Handwritten mathematic expression
STR-CPC [4] • Based on CPC (contrastive predictive
coding)
• A widthwise causal convolution is
designed to alleviate the information
overlap problem
• Progressive Recovery Training Strategy
(PRTS)
CMT-Co [5]
• Based on MoCo v2
• A new character unit cropping
module is designed
58
SSL Methods in the field of OCR
SimAN[6]
[6] Luo C, Jin L, Chen J. SimAN: Exploring Self-Supervised Representation Learning of Scene Text via Similarity-Aware Normalization, CVPR. 2022.
[7] Lyu P, et al. MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining[J]. arXiv preprint arXiv:2206.00311, 2022.
[8] Yang M, Liao M, Lu P, et al. Reading and Writing: Discriminative and Generative Modeling for Self-Supervised Text Recognition, ACM MM, 2022.
DiG[8]
Hybrid Generative & Discriminative SSL
Generative SSL
MaskOCR[7]
SSL Methods in OCR
• Relational Contrastive
Learning (RCL) for STR
• Hierarchy representation
learning (word, subword,
frame)
• Relational regularization,
hierarchical relations, inter-
hierarchy relational
consistency
[10] Jinglei Zhang, Tiancheng Lin, Relational Contrastive Learning for Scene Text Recognition, ACM MM 2023.
p RCLSTR
[9] Text-DIAE: A Self-Supervised Degradation Invariant Autoencoder for Text Recognition and Document Enhancement, AAAI 2023
p Text-DIAE
• A new SSL method called Text-
Degradation Invariant Auto
Encoder (Text-DIAE)
• Three pretext tasks (masking,
blur and background noise)
• Fast convergence
59
p SelfDocSeg
SSL Methods in OCR
[11] Subhajit Maity, et al., SelfDocSeg: A self-supervised vision-based approach towards Document Segmentation, ICDAR 2023
[12] Z. Qiao, et al., Decoupling Visual-Semantic Features Learning with Dual Masked Autoencoder for Self-Supervised STR, ICDAR 2023
p Dual-MAE
• Decouple visual and semantic feature
learning with different masking strategies
• A Siamese network is used to align the
dual features
• A new SSL method for document
layout analysis based on BYOL
• Pseudo-layouts (Layout mask) from
the document images are generated to
pre-train the image encoder
• Fine-tuning by an object detector
60
61
SSL Methods in OCR
Qing Jiang, Jiapeng Wang, Dezhi Peng, Chongyu Liu, Lianwen Jin*, Revisiting Scene Text Recognition: A Data Perspective, ICCV 2023
Paper Preprint: https://arxiv.org/abs/2307.08723
p MAEReg: A simple yet effective scene text recognizer based on ViT & MAE-SSL
• A ViT backbone, a Transformer decoder, training by the MAE self-supervised learning with some minor modifications
Code & Data are available at:
https://union14m.github.io
62
SSL Methods in OCR
Qing Jiang, Jiapeng Wang, Dezhi Peng, Chongyu Liu, Lianwen Jin*, Revisiting Scene Text Recognition: A Data Perspective, ICCV 2023
Paper Preprint: https://arxiv.org/abs/2307.08723
p MAEReg: A simple yet effective scene text recognizer based on ViT & MAE-SSL
Code & Data are available at:
https://union14m.github.io
ICDAR WML 2023 63
Discussion & Prospects
ICDAR WML 2023
Is the problem of OCR/STR nearly solved?
• Great progress in the filed of OCR has been achieved in recent year
• The current progress in scene text recognition (STR) has exhibited a trend of accuracy saturation
1. Whether the common benchmarks remain sufficient to promote future progress?
2. Whether this accuracy saturation implies that STR is solved?
Qing Jiang, Jiapeng Wang, Dezhi Peng, Chongyu Liu, Lianwen Jin*, Revisiting Scene Text Recognition: A Data Perspective, ICCV 2023
• The six widely used datasets (IC13, SVT, IIIT, IC15, SVTP, CUTE80) are relatively small in scale
• The six datasets lack representativeness of various real-world scenarios.
• The six datasets are less challenging, thus concealing the underlying issues that STR faces
64
1
ICDAR WML 2023
Revisiting OCR: A Data Perspective
• We consolidate a large-scale real STR dataset, Union14M, to investigate the challenges faced by STR models in the real world
• The Union14M benchmark contains two subsets, Union14M-L, Union14M-U
• Union14M-L: 4M labeled images, 3,230,742 training samples, 400,000 validation samples, 409,383 testing samples
• Union14M-U: 10M unlabeled images
13 representative STR models (trained on MJ & ST
synthetic datasets) evaluate on Union14M-L show a
sigificant performance degradation on the Union14M-
L test set.
Qing Jiang, Jiapeng Wang, Dezhi Peng, Chongyu Liu, Lianwen Jin*,
Revisiting Scene Text Recognition: A Data Perspective, ICCV 2023
ICDAR WML 2023
Revisiting OCR: A Data Perspective (cont.)
• Experiments on the new Union14M-L benchmark show that STR is still far from being solved
• Training with the new Union14M dataset produces much better performance
Left table: Recognition accuracy of models
trained on synthetic datasets (MJ and ST)
Right table: Recognition accuracy of models
trained on the training set of Union14M-L.
For MAERec, S and B represent the use of
ViT-Small and ViT-Base as the backbone,
respectively. PT denotes pre-training
Qing Jiang, Jiapeng Wang, Dezhi Peng, Chongyu
Liu, Lianwen Jin*, Revisiting Scene Text
Recognition: A Data Perspective, ICCV 2023
66
ICDAR WML 2023
Revisiting OCR: A Data Perspective (cont.)
• Self-supervised pre-training is an effective way to utilize massive amounts of unlabeled data (the MAERec model)
• Quality of dataset is more important than the quantity of dataset
Qing Jiang, Jiapeng Wang, Dezhi Peng, Chongyu Liu, Lianwen Jin*, Revisiting Scene Text Recognition: A Data Perspective, ICCV 2023
Code & Data are available at: https://union14m.github.io
Paper preprint: https://arxiv.org/abs/2307.08723
67
ICDAR WML 2023
GPT4: Potential OCR ability of LLM?
68
• On Nov.30, 2022, OpenAI released the ChatGPT large-scale language model (LLM)
• GPT-4 (released on March 14, 2023) already show great potential for OCR
2
ICDAR WML 2023
Multimodel LLM Demostrate Amazong Zero-shot OCR Capabilities
Q. Ye, H. Xu, G. Xu, et al., mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality, arXiv 2023.04
69
Large multimodal models are a continuing research hotspot recently
ICDAR WML 2023
Potential and limitations of multimodal LLM in the OCR
Yuliang Liu, Zhang Li, et al., On the Hidden Mystery of OCR in Large Multimodal Models, arXiv 2023
• A comprehensive study of existing publicly available multimodal models, evaluating their performance in various OCR tasks
• The preliminary assessment reveals that LMM can achieve promising results, especially in text recognition
70
ICDAR WML 2023
CLIP for Scene Text Detection/Recognition
W. Yu, Y, Liu, .. Xiang Bai*, Turning a CLIP Model into a Scene Text
Detector, CVPR 2023
Shuai Zhao, et al., CLIP4STR: A Simple Baseline for Scene Text
Recognition with Pre-trained Vision-Language Model, arXiv , 2023.05.23
• Introducing multimodal large models (e.g. CLIP) to assist in improving text detection and recognition performance
is another promising direction in recent research
71
3
ICDAR WML 2023
Toward a Unified Document Understanding Model(UDoP)
Zineng Tang, Ziyi Yang, et al., Unifying Vision, Text, and Layout for Universal Document Processing, CVPR 2023
• Five major problem categories,
more than 10 Document AI
subtasks
• SOTA on 8 Document AI tasks
• Designing a unified OCR model that can handle different tasks is also an important research topic
72
4
73
Conclusion
l OCR has been an active research field for over 40 years. It has become one of the core AI technologies in
many industry applications
l Many problems are still not completely solved
Ø Complex layout analysis, understanding and reconstruction of complex document
Ø End-to-end Visual Information Extraction (VIE)
Ø Table recognition & understanding in the wild
Ø Chart analysis and understanding
Ø TextVQA, DocVQA in the wild
Ø Robust tampered text detection in document Image
l Large-scale pre-training model and SSL for OCR are important future research topics
Ø OCR Big Model / OCR Fundamental Model
ü Universal OCR model that can deal with various OCR tasks
73
Thank you!
August 26, 2023
Lianwen Jin
Email!eelwjin@scut.edu.cn (primary); lianwen.jin@gmail.com (secondary)
URL ! http://www.dlvc-lab.net/lianwen/
Lab of Deep Learning & Vision Computing
South China University of Technology
ICDAR WML 2023

More Related Content

What's hot

機械学習チュートリアル@Jubatus Casual Talks
機械学習チュートリアル@Jubatus Casual Talks機械学習チュートリアル@Jubatus Casual Talks
機械学習チュートリアル@Jubatus Casual Talks
Yuya Unno
 
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
Simplilearn
 
Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~
Yuya Unno
 

What's hot (20)

【DL輪読会】GPT-4Technical Report
【DL輪読会】GPT-4Technical Report【DL輪読会】GPT-4Technical Report
【DL輪読会】GPT-4Technical Report
 
自然言語処理に基づく商品情報の整理および構造化
自然言語処理に基づく商品情報の整理および構造化自然言語処理に基づく商品情報の整理および構造化
自然言語処理に基づく商品情報の整理および構造化
 
MLP-Mixer: An all-MLP Architecture for Vision
MLP-Mixer: An all-MLP Architecture for VisionMLP-Mixer: An all-MLP Architecture for Vision
MLP-Mixer: An all-MLP Architecture for Vision
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
 
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
 
NLTK
NLTKNLTK
NLTK
 
[DL輪読会]Dense Captioning分野のまとめ
[DL輪読会]Dense Captioning分野のまとめ[DL輪読会]Dense Captioning分野のまとめ
[DL輪読会]Dense Captioning分野のまとめ
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
[cvpaper.challenge] 超解像メタサーベイ #meta-study-group勉強会
[cvpaper.challenge] 超解像メタサーベイ #meta-study-group勉強会[cvpaper.challenge] 超解像メタサーベイ #meta-study-group勉強会
[cvpaper.challenge] 超解像メタサーベイ #meta-study-group勉強会
 
Lecture 6
Lecture 6Lecture 6
Lecture 6
 
Find nuclei in images with U-net
Find nuclei in images with U-netFind nuclei in images with U-net
Find nuclei in images with U-net
 
機械学習チュートリアル@Jubatus Casual Talks
機械学習チュートリアル@Jubatus Casual Talks機械学習チュートリアル@Jubatus Casual Talks
機械学習チュートリアル@Jubatus Casual Talks
 
自然言語処理における深層学習を用いた予測の不確実性 - Predictive Uncertainty in NLP -
自然言語処理における深層学習を用いた予測の不確実性  - Predictive Uncertainty in NLP -自然言語処理における深層学習を用いた予測の不確実性  - Predictive Uncertainty in NLP -
自然言語処理における深層学習を用いた予測の不確実性 - Predictive Uncertainty in NLP -
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
[DL Hacks]Variational Approaches For Auto-Encoding Generative Adversarial Ne...
[DL Hacks]Variational Approaches For Auto-Encoding  Generative Adversarial Ne...[DL Hacks]Variational Approaches For Auto-Encoding  Generative Adversarial Ne...
[DL Hacks]Variational Approaches For Auto-Encoding Generative Adversarial Ne...
 
自然言語処理によるテキストデータ処理
自然言語処理によるテキストデータ処理自然言語処理によるテキストデータ処理
自然言語処理によるテキストデータ処理
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
 
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
 
CNN Tutorial
CNN TutorialCNN Tutorial
CNN Tutorial
 
Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~
 

Similar to Optical Character Recognition: from data driven to self-supervised learning (ICDAR WML 2023 Keynote)

An offline signature verification using pixels intensity levels
An offline signature verification using pixels intensity levelsAn offline signature verification using pixels intensity levels
An offline signature verification using pixels intensity levels
Salam Shah
 
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUESA STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
ijcsitcejournal
 
PhD Proposal talk
PhD Proposal talkPhD Proposal talk
PhD Proposal talk
Ray Buse
 

Similar to Optical Character Recognition: from data driven to self-supervised learning (ICDAR WML 2023 Keynote) (20)

Writer Identification via CNN Features and SVM
Writer Identification via CNN Features and SVMWriter Identification via CNN Features and SVM
Writer Identification via CNN Features and SVM
 
A Convolutional Neural Network approach for Signature verification
A Convolutional Neural Network approach for Signature verificationA Convolutional Neural Network approach for Signature verification
A Convolutional Neural Network approach for Signature verification
 
Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons Learned
 
Bangla_handwritten_dig1] final proposal .pdf
Bangla_handwritten_dig1] final proposal .pdfBangla_handwritten_dig1] final proposal .pdf
Bangla_handwritten_dig1] final proposal .pdf
 
The state of the art in handwriting synthesis
The state of the art in handwriting synthesisThe state of the art in handwriting synthesis
The state of the art in handwriting synthesis
 
Modi script character recognition
Modi script character recognitionModi script character recognition
Modi script character recognition
 
Handwritten Digit Recognition Using CNN
Handwritten Digit Recognition Using CNNHandwritten Digit Recognition Using CNN
Handwritten Digit Recognition Using CNN
 
Intelligent Career Guidance System.pptx
Intelligent Career Guidance System.pptxIntelligent Career Guidance System.pptx
Intelligent Career Guidance System.pptx
 
An offline signature verification using pixels intensity levels
An offline signature verification using pixels intensity levelsAn offline signature verification using pixels intensity levels
An offline signature verification using pixels intensity levels
 
Design and implementation of optical character recognition using template mat...
Design and implementation of optical character recognition using template mat...Design and implementation of optical character recognition using template mat...
Design and implementation of optical character recognition using template mat...
 
IRJET - A Survey on Recognition of Strike-Out Texts in Handwritten Documents
IRJET - A Survey on Recognition of Strike-Out Texts in Handwritten DocumentsIRJET - A Survey on Recognition of Strike-Out Texts in Handwritten Documents
IRJET - A Survey on Recognition of Strike-Out Texts in Handwritten Documents
 
Online Handwriting Recognition using HMM
Online Handwriting Recognition using HMMOnline Handwriting Recognition using HMM
Online Handwriting Recognition using HMM
 
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUESA STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
 
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
 
Segmentation and recognition of handwritten gurmukhi script
Segmentation  and recognition of handwritten gurmukhi scriptSegmentation  and recognition of handwritten gurmukhi script
Segmentation and recognition of handwritten gurmukhi script
 
PhD Proposal talk
PhD Proposal talkPhD Proposal talk
PhD Proposal talk
 
11.development of a writer independent online handwritten character recogniti...
11.development of a writer independent online handwritten character recogniti...11.development of a writer independent online handwritten character recogniti...
11.development of a writer independent online handwritten character recogniti...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
HandwrittenDigitRecognitionusing.pdf
HandwrittenDigitRecognitionusing.pdfHandwrittenDigitRecognitionusing.pdf
HandwrittenDigitRecognitionusing.pdf
 
Online Hand Written Character Recognition
Online Hand Written Character RecognitionOnline Hand Written Character Recognition
Online Hand Written Character Recognition
 

Recently uploaded

Recently uploaded (11)

ServiceNow CIS-Discovery Exam Dumps 2024
ServiceNow CIS-Discovery Exam Dumps 2024ServiceNow CIS-Discovery Exam Dumps 2024
ServiceNow CIS-Discovery Exam Dumps 2024
 
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdfMicrosoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
 
Deciding The Topic of our Magazine.pptx.
Deciding The Topic of our Magazine.pptx.Deciding The Topic of our Magazine.pptx.
Deciding The Topic of our Magazine.pptx.
 
Databricks Machine Learning Associate Exam Dumps 2024.pdf
Databricks Machine Learning Associate Exam Dumps 2024.pdfDatabricks Machine Learning Associate Exam Dumps 2024.pdf
Databricks Machine Learning Associate Exam Dumps 2024.pdf
 
SaaStr Workshop Wednesday with CEO of Guru
SaaStr Workshop Wednesday with CEO of GuruSaaStr Workshop Wednesday with CEO of Guru
SaaStr Workshop Wednesday with CEO of Guru
 
The Influence and Evolution of Mogul Press in Contemporary Public Relations.docx
The Influence and Evolution of Mogul Press in Contemporary Public Relations.docxThe Influence and Evolution of Mogul Press in Contemporary Public Relations.docx
The Influence and Evolution of Mogul Press in Contemporary Public Relations.docx
 
2024-05-15-Surat Meetup-Hyperautomation.pptx
2024-05-15-Surat Meetup-Hyperautomation.pptx2024-05-15-Surat Meetup-Hyperautomation.pptx
2024-05-15-Surat Meetup-Hyperautomation.pptx
 
DAY 0 8 A Revelation 05-19-2024 PPT.pptx
DAY 0 8 A Revelation 05-19-2024 PPT.pptxDAY 0 8 A Revelation 05-19-2024 PPT.pptx
DAY 0 8 A Revelation 05-19-2024 PPT.pptx
 
Understanding Poverty: A Community Questionnaire
Understanding Poverty: A Community QuestionnaireUnderstanding Poverty: A Community Questionnaire
Understanding Poverty: A Community Questionnaire
 
ACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdf
ACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdfACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdf
ACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdf
 
TSM unit 5 Toxicokinetics seminar by Ansari Aashif Raza.pptx
TSM unit 5 Toxicokinetics seminar by  Ansari Aashif Raza.pptxTSM unit 5 Toxicokinetics seminar by  Ansari Aashif Raza.pptx
TSM unit 5 Toxicokinetics seminar by Ansari Aashif Raza.pptx
 

Optical Character Recognition: from data driven to self-supervised learning (ICDAR WML 2023 Keynote)

  • 1. Optical Character Recognition From data-driven to self-supervised learning Lianwen Jin South China University of Technology http://www.dlvc-lab.net/lianwen/ August 26, 2023 ICDAR WML 2023
  • 2. ICDAR WML 2023 Outline l Introduction l Data Synthesis for OCR • Handwriting Style Synthesis for Arbitrary-Length and Out-of-Vocabulary Text • Forgery- Synthesis for Handwritten Signature Verification l Weakly Supervised Learning Methods for OCR • Single-Point Text Spotting (SPTS ) • Weakly Supervised Handwritten Chinese Text Recognition (Page-level) l Self-Supervised Learning Methods for OCR • A Brief Review of SSL in OCR • Revisiting Scene Text Recognition: A Data Perspective l Discussion and Conclusion 2
  • 3. • OCR (Optical Character Recognition) is an important research field in PR and AI • Scene Text Recognition (STR) has attracted great research attention in recent years • Research trends: from visually perceptive to semantically driven, from strongly supervised to self-supervised learning Irregular Scene Text Recognition RARE, CVPR’16 ASTER, TPAMI’18 MORAN, PR’19 ESIR, CVPR’19 ScRN, ICCV’19 SAR, AAAI’19 Encoder/Decoder Model FAN, ICCV’17 DAN, AAAI’20 GTC, AAAI’20 ACE, CVPR ‘20 RobustScanner, ECCV’20 SCATTER, CVPR’20 IFA, CVPR’21 Segmentation-based CA-FCN, AAAI’19 Mask TextSpotter v2, TPAMI’19 TextScanner, AAAI’20 SegHCCR, TMM’22 PageNet, IJCV 20222 SegCTC, ICDAR 2023 Image Enhancement PlugNet, ECCV’20 SPIN, AAAI’21 STT, CVPR’21 TATT, CVPR ’22 PSRB-DIP, ICDAR ‘23 Data Synthesis Synth90K, IJCV’16 SynthText, CVPR’16 VerisimilarText, ECCV’18 UnRealText, CVPR’20 ScrabbleGAN, CVPR’20 GANwriting, ECCV’20, TPAMI’22 HiGAN, AAAI’21; HiGAN++, TOG’22 VATr, CVPR 2023 GC-DDPM, ICDAR 2023 Language Modeling SRN, CVPR’20; SEED, CVPR’20 DictGuide, CVPR’21 Bhunia et al., CVPR’21 FromTwoToOne, ICCV’21 ABINet/++, CVPR’21, TPAMI’22 ViSA, ICDAR 2023 CLIP-TCM, CVPR 2023 Self-Supervised Learning Bhunia et al., CVPR’21 Baek et al., CVPR’21; SeqCLR, CVPR’21 ConCLR, AAAI’22 SimAN: CVPR’22 DiG, ECCV’22 Text-DIAE, AAAI 2023 RCLSTR, MM 2023 DualMAE, ICDAR 2023 SelfDocSeg, ICDAR 2023 OCR is an important research problem in AI & CV 3
  • 4. ICDAR WML 2023 Deep learning is going to be able to do everything? https://www.technologyreview.com/2020/11/03/1011616/ai-godfather- geoffrey-hinton-deep-learning-will-do-everything/ Deep Learning is everywhere. CNNs are everywhere! Deep Learning have also become one of the most dominant methodology in the fields of OCR 4
  • 5. ICDAR WML 2023 图像来源及相关材料: “…The emergence of GPUs and the availability of large datasets were key enablers of deep learning…” ——Yoshua Bengio, Yann LeCun, and Geoffrey Hinton • “Data is food for AI” • “Tuning data is more important than tuning models 。” • 80% Data + 20% Model = Better Machine Learning ——Andrew Ng Data is a fundamental key to enable Deep Learning success 5 A commonly cited rule of thumb is that 80% of the success of a deep learning project is due to the quality and quantity of available data, while the remaining 20% is due to the specific machine learning model being used
  • 6. ICDAR WML 2023 Data Issue • However, we cannot always obtain sufficient training data • For instance, data involving personal privacy, financial data, government data… • How to mitigate the high dependence of deep learning models on large data ? • Data Synthesis/Data Augmentation • Weakly Supervised Learning/ Weakly Annotated Data • Self-Supervised Learning (SSL) • … 6
  • 7. ICDAR WML 2023 7 Data Driven • Offline handwriting synthesis: SLOGAN: Handwriting Style Synthesis for Arbitrary-Length and Out-of-Vocabulary Text , IEEE TNNLS, 2022 • Online handwriting synthesis: : SynSig2Vec: Forgery-free Learning of Dynamic Signature Representations and 1D CNN, IEEE TPAMI 2022 How to mitigate the high dependence of deep learning models on large data ? Data Synthesis/Data Augmentation Weakly Supervised Learning/ Weakly Annotated Data Self-Supervised Learning (SSL) Outline
  • 8. 8 Handwriting Style Synthesis for Arbitrary-Length Text Imbalanced distribution of handwriting styles in the IAM dataset (a) (b) Different writers Same writer l Motivation! Ø Existing data synthesis methods do not provide a good guarantee of handwriting style diversity Ø Biased training datasets make it difficult to train models with good performance Ø the distribution of handwriting styles from the popular IAM training set [3] reveals significant biases in terms of both the style and frequency distribution 8 Data from the popular IAM benchmark
  • 9. 9 • Style representation : we propose a style bank to parameterize the specific handwriting styles as latent vectors • Writing Diversity: The handwriting style is parameterized; after the training is completed, the style parameters are randomly adjusted to obtain a variety of new styles. • Content embedding of the text: Input text, output handwriting image SLOGAN: Handwriting Style Synthesis for Arbitrary-Length and OOV Text Canjie Luo, Yuanzhi Zhu, Lianwen Jin, et al. SLOGAN: Handwriting Style Synthesis for Arbitrary-Length and Out-of-Vocabulary Text, IEEE TNNLS, 2022. 9 SLOGAN
  • 10. 10 Style bank of parameterized handwriting style vectors 10 • The style bank embeds the writer ID into a latent vector, which is taken by the generator to transfer the printed style to the corresponding handwriting style. • The style bank can be learnt automatically. • The style bank and the generator are jointly optimized under the supervision of the dual discriminators.
  • 11. 11 Character Style Discriminator: Discriminator 11 Content Discriminator: Cursive Joint Discriminator: Writer ID Discriminator: We design two types of discriminators to help train our GAN based model, which consists of a separated character discriminator and a cursive discriminator.
  • 12. 12 l Handwriting styles are parameterized as latent vector , in which the element zk is manipulatable to control the generated styles. l We can interpolate the entire vector z between two random values achieve style interpolation l Manipulating certain elements zk to achieve special attribute changes Experiments: Style Diversity • Stroke Thickness • Paper Background • Character Slant 12
  • 13. 13 Different adjacent character interval curved text Experiments: Synthesis of arbitrary text 13
  • 15. ICDAR WML 2023 Comparison with existing methods 15 GAN Metric Recognition Metric WER : Word Error Rate CER : Character Error Rate FID: Frechet Inception Distance GS: Geometric-Score Canjie Luo, Yuanzhi Zhu, Lianwen Jin, et al. SLOGAN: Handwriting Style Synthesis for Arbitrary-Length and Out-of-Vocabulary Text. IEEE TNNLS, 2022.
  • 16. ICDAR WML 2023 Comparison with existing methods 16
  • 17. ICDAR WML 2023 Distribution of the handwriting styles 17 l Distribution of the handwriting styles of the word “the” via t-SNE. Distribution of the existing styles in the IAM dataset Distribution after adding our generated samples • From left, it can be seen that the large amount of empty space in the original distribution suggests the limitation of styles. From right, it can be seen that with our generated various styles, the distribution is more even and reasonable, which indicates the bias of the style is significantly rectified
  • 18. 18 SynSig2Vec: A New Forgery-free Dynamic Signature Representation Learning Method for Signature Verification - Songxuan Lai, Lianwen Jin, et al., SynSig2Vec: Learning Representations from Synthetic Dynamic Signatures for Real-world Verification, AAAI 2020 - Songxuan Lai, Lianwen Jin, et al., SynSig2Vec: Forgery-free Learning of Dynamic Signature Representations and 1D CNN, IEEE TPAMI 2022 2 l Handwriting signature verification is an important biometric problem and has wide applications l Signatures may be skillfully imitated by a forger l Skilled forgery data are difficulty to collect l SynSig2Vec requires only genuine signatures for training, yet achieves SOTA performance Code: https://github.com/LaiSongxuan/SynSig2Vec
  • 19. ICDAR WML 2023 19 Signature Parameterization using Sigma Lognormal • The kinematic theory of rapid human movements suggests that human handwriting consists in controlling the pen- tip velocity with overlapped lognormal impulse responses. • The magnitude and direction of the velocity v_i(t) of the ith stroke is described as the following lognormal: • Velocity of a signature can be modeled as the sigma lognormal: • Six parameters for stroke i: • A handwriting signature can be parameterized as: The velocity profile of a typical human handwriting component consists of lognormal impulse responses
  • 20. ICDAR WML 2023 20 Signature Synthesis by perturbating the parameter matrix P 𝑅!, 𝑅"#, 𝑅$, 𝑅%, 𝑅&', 𝑅&( are empirical determined random parameters • Given the parameter matrix P, a trajectory can be recovered as follows: • Introducing random perturbations to the parameter matrix P to generate distorted signatures
  • 21. ICDAR WML 2023 Signature Distortion Levels 21 Configurations of the random variables that decide the signature perturbations levels. • By carefully setting the perturbations level, we can generate three categories of signatures: - Low distorted samples: used as data augmentation of the true signatures - Middle distorted samples: used as skilled forgeries - High distorted samples: used as random forgeries • By this way, we can generate the negative training data based on only the genuine data to train a deep learning model
  • 22. ICDAR WML 2023 Sig2Vec : An effective signature feature representation NN 22 𝐹 → {𝐹)})*+ ,!" ,𝐹) ∈ ℛ|.|×!!" vec) = softmax 𝑤)𝐹) 0 𝐷12 𝐹) 6 vec = [vec+, …, vec,!" The architecture of the proposed Sig2Vec model • Sig2Vec model extracts holistic representations from time functions of online writing signatures • The model is trained using both cross entropy and average precision loss function
  • 23. ICDAR WML 2023 23 Average Precision Optimization for Ranking l We compute and rank similarities of different signatures and incorporate the Average Precision of the ranking into the loss function for optimization. Ø As AP (Average Precision) is non-differential to the network’s outputs, we use the weight update rule for AP optimization according the the General Loss Gradient Theorem[1] [1] Y. Song, A. Schwing, R. Urtasun et al., Training deep neural networks via direct loss minimization, ICML 2016
  • 24. ICDAR WML 2023 24 Experimental Results l Effectiveness of Average Precision Optimization Ø Comparing with traditional BCE or Triplet loss, AP achieves lower error rate and converges faster
  • 25. ICDAR WML 2023 Experimental Results 25 Songxuan Lai, Lianwen Jin, et al., SynSig2Vec: Forgery-free Learning of Dynamic Signature Representations …and 1D CNN, IEEE TPAMI 2022 Datasets Methods Number of templates Threshold Global User-specific MCYT-100 SRSS based on ΣΛ model[1] 1 13.56 - Stroke-RNN[2] 1 10.46 - SynSig2Vec (ours) 1 3.84 1.59 Symbolic representation[3] 5 5.70 2.20 DTW cost matrix information[4] 5 2.76 1.15 DTW with SCC[5] 5 - 2.15 Deep DTW[6] 5 2.40 - Single-template strategy[7] 5 - 1.28 Single-template strategy+LS-DTW[8] 5 - 0.72 SynSig2Vec (ours) 4 2.38 0.96 SVC-Task2 SRSS based on ΣΛ model[1] 1 18.25 - SynSig2Vec (ours) 1 12.16 5.83 DTW cost matrix information[4] 5 7.80 2.53 DTW with SCC[5] 5 - 2.63 Single-template strategy[7] 5 - 2.98 Single-template strategy+LS-DTW[8] 5 - 2.08 SynSig2Vec (ours) 5 3.88 2.08 [1] Diaz M, et al. IEEE TCYB 2016. [2] Li C, et al. ICDAR 2019. [3] Guru D, et al. ESWA 2017. [4] Sharma A, et al. IEEE TCYB 2017. [5] Xia X, et al. PR 2017. [6] Wu X, et al. ICDAR 2019. [7] Okawa. PR 2020. [8] Okawa. PR 2021. Code: https://github.com/LaiSongxuan/SynSig2Vec • Our method achieves state-of-the-art results on two widely used benchmarks, the MCT and SVC datasets. • The advantage of out method is that it requires only genuine signatures for training, which is more practical for developing deep learning-based real-world systems
  • 26. ICDAR WML 2023 Weakly Supervised Text Recognition How to mitigate the high dependence of deep learning models on large labeled data ? Data Synthesis/Data Augmentation Weakly Supervised Learning/ Weakly Annotated Data Self-Supervised Learning (SSL) Outline
  • 27. ICDAR WML 2023 SPTS: Single-Point Text Spotting • End-to-end scene text spotting has made significant progress in recent years. • To train an end-to-end text spotting model, existing methods commonly regard manual annotations such as horizontal rectangles, rotated rectangles, quadrangles, and polygons as a prerequisite, which are very expensive • We show that text spotters can be supervised by a simple yet effective single-point representation (single point+transcription). 27 Dezhi Peng, et al., Lianwen Jin*, SPTS: Single-Point Text Spotting, ACM MM 2022. oral 1 Code: https://github.com/shannanyinxiang/SPTS
  • 28. ICDAR WML 2023 SPTS: Single-Point Text Spotting • A new simple yet effective Transformer-based scene text spotter, inspired by [1]. a) The scene text spotting is formed as a language modeling task. b) An intuitive assumption: if a deep learning model knows what and where the text instances are, it can be taught to generate the desired sequence of results (locations and transcriptions) • The overall framework: a) CNN + Transformer encoder extract the visual and context features b) Transformer decoder predicts a sequence that is subsequently translated into points and transcriptions. c) The complex post-processing and RoI operations are avoided, and better fusion of text detection and recognition is achieved. 28 Dezhi Peng, et al., Lianwen Jin*, SPTS: Single-Point Text Spotting, ACM MM 2022 Code: https://github.com/shannanyinxiang/SPTS [1] T. Chen, S. Saxena, L. Li, D Fleet, G. Hinton, Pix2seq: A Language Modeling Framework for Object Detection, ICLR 2022.
  • 29. ICDAR WML 2023 Sequence Construction • The transcriptions are naturally discrete which consist of character categories. • The continuous coordinates 𝑥, 𝑦 of points are discretized into integers between 1, 𝑛3)4' . • To solve the problem that the transcriptions are of various lengths, we pad (using <PAD> tokens) or truncate them to a fixed length 𝑙"5. • The sequences of text instances are randomly ordered and then concatenated. • The <SOS> (start of seq.) and <EOS> (end of seq.) are inserted to the head and tail of the sequence, respectively. 29 How to construct the output sequence
  • 30. ICDAR WML 2023 Training • The input sequence and output sequence of the decoder: 30 • The model is trained to maximize the likelihood loss given in Eq. (1), where 𝐼 is the input image, ̃ 𝑠 is the output sequence, 𝑠 is the input sequence, and 𝑤! is set to 1.
  • 31. ICDAR WML 2023 Inference • The model auto-regressively predict the sequence until the <EOS> occurs. • The predicted sequence is divided into multiple segments, each of which contains 2 + 𝑙"# tokens (2 tokens for the coordinate of the point and 𝑙"# tokens for the transcription). • The segments are translated into the point coordinates and transcriptions of text instances. • The average likelihood of the tokens in each segment is assigned as the score of the corresponding text instance. 31
  • 32. ICDAR WML 2023 Experimental Datasets • Curved Synthetic Dataset 150k[1]: 150k samples with 1/3 curved texts and 2/3 horizontal texts. • ICDAR 2013[2]: 229 training and 233 testing samples with horizontal texts. • ICDAR 2015[3]: 1000 training and 500 testing samples with multi-oriented texts. • Total-Text[4]: 1255 training and 300 testing samples with arbitrarily shaped texts. • SCUT-CTW1500[5]: 1000 training and 500 testing samples with arbitrarily shaped texts. 32 [1] Y. Liu, et al. ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network. CVPR 2020. [2] D. Karatzas, et al. ICDAR 2013 robust reading competition. ICDAR 2013. [4] C. K. Ch’ng, et al. Total-Text: A comprehensive dataset for scene text detection and recognition, ICDAR 2017. [3] D. Karatzas, et al. ICDAR 2015 competition on robust reading. ICDAR 2015. [5] Y. Liu, et al. Curved scene text detection via transverse and longitudinal sequence connection. PR 2019. (a) ICDAR 2013 (b) ICDAR 2015 (c) Total-Text (d) SCUT-CTW1500
  • 34. ICDAR WML 2023 Visualization • Visualization results on Total-Text (the first row) and SCUT-CTW1500 (the second row) benchmarks. 34 Dezhi Peng, et al., Lianwen Jin*, SPTS: Single-Point Text Spotting, ACM MM 2022. oral
  • 35. ICDAR WML 2023 SPTS v2 35 Yuliang Liu, …., Lianwen Jin*, SPTS v2: Single-Point Text Spotting, (under review) https://arxiv.org/abs/2301.01635 • A new Instance Assignment Decoder (IAD) • A new Parallel Recognition Decoder
  • 36. ICDAR WML 2023 No-Point Text Spotting • We further show that SPTS can be trained even without the supervision of single-point annotations • No-Point Text Spotting (NPTS) model is obtained by removing the point coordinates from the constructed sequence. • The right figure shows the qualitative results of NPTS. The table below compares SPTS and NPTS. • The experimental results indicate that the NPTS model can learn the ability to implicitly find out the locations of text merely based on transcriptions. 36 Dezhi Peng, et al., Lianwen Jin*, SPTS: Single-Point Text Spotting, ACM MM 2022. oral Code: https://github.com/shannanyinxiang/SPTS
  • 37. ICDAR WML 2023 Extension of SPTS:Single-Point Object Detection 37 • The single-point object detection experiments are conducted using the Pascal VOC object detection dataset. • The model is trained with central points and corresponding categories. • Preliminary qualitative results on the validation set are shown in Figure 10. • The singe point might be viable to provide extremely low-cost annotation for general object detection. Dezhi Peng, et al., Lianwen Jin*, SPTS: Single-Point Text Spotting, ACM MM 2022. oral Code: https://github.com/shannanyinxiang/SPTS
  • 38. PageNet: Weakly Supervised HCTR • We propose PageNet which is the first method for end-to-end weakly-supervised page-level HCTR. • The model is trained without any bounding box, i.e. only given the line-level transcripts, however, it can output segmentation and recognition at both line-level and character-level. • To the best of our knowledge, PageNet is the first method to address the reading order problem in page level HCTR. The model can handle pages with multidirectional reading order and arbitrarily curved text lines. 你的價值。 不會在别人的肯定上 你怎麽看自己。 才是最重要 你 的 價 值 。 不 會 在 别 人 的 肯 定 上 你 怎 麽 看 自 己 。 才 是 最 重 要 Input Supervision Output Dezhi Peng, Lianwen Jin*, et al., PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition, IJCV, 2022 Code: https://github.com/shannanyinxiang/PageNet 2 PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition (HCTR) 38
  • 39. Annotation comparison • Comparison of the required annotations versus the model output of existing page-level methods Dezhi Peng, Lianwen Jin*, et al., PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition, IJCV 2022 • Bluche T., Joint line segmentation and transcription for end-to-end handwritten paragraph recognition, NIPS, 2016. • Yousef M., et al., OrigamiNet : Weakly-Supervised , Segmentation-Free , One-Step , Full Page Text Recognition by learning to unfold, CVPR, 2020. • Wigington C., et al., Start, follow, read: End-to-end full-page handwriting recognition, ECCV, 2018. • Huang Y. et al., Adversarial feature enhancing network for end-to-end handwritten paragraph recognition, ICDAR 2019 • Ma W., et al., Joint layout analysis, character detection and recognition for historical document digitization, ICFHR 2020 39
  • 40. ICDAR WML 2023 Architecture of PageNet 40 Dezhi Peng, Lianwen Jin*, et al., PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition, IJCV 2022 Code: https://github.com/shannanyinxiang/PageNet
  • 41. ICDAR WML 2023 Reading Order Module of PageNet • Given an unordered set of characters, the reading order problem is to determine the order in which characters are read. 41 • We solve the reading order problem by making three predictions, namely the start- of-line distribution, the 4-directional reading order prediction and the end-of- line distribution • From a character, we can find the next character in the reading order by iteratively moving from a grid to the next according to the direction with maximum probability until arriving at a new character, as illustrated in the visualization of search paths in Fig. 3.
  • 42. ICDAR WML 2023 Graph-based Decoding Algorithm 42 Pipeline of the graph-based decoding algorithm. • Nodes. Each character detection and recognition result is viewed as a node. Therefore, each node corresponds to a grid in which the bounding box and category of character are predicted. • Edges. Based on the 4-directional reading order prediction, we find the next node of every node. • Reading Order. We distinguish whether a node is the start-of-line or the end-of-line according to the start-of-line distribution and the end-of-line distribution. Then, the reading order is represented by the paths that start at the start-of-line and end at the end-of-line.
  • 43. ICDAR WML 2023 Weakly Supervised Learning 43 • Matching: match the results of PageNet with the line-level transcripts in the annotations to find reliable results • Weakly annotated data using font is used • Updating: Use the reliable results to update pseudo-labels • Optimization: Calculate the losses using the updated pseudo-labels to optimize the parameters.
  • 44. Weakly Supervised Learning • Optimization Ø Detection branch Ø Classification branch Ø Location branch Ø Start of line & End of line Ø 4-directional reading order • Total Loss 44
  • 45. ICDAR WML 2023 Experimental Results 45 • Det + Recog (Table 2): Faster R-CNN + RRPN + CRNN Recognizer • Det + Recog (Table 4): Mask R-CNN + CRNN recognizer
  • 47. Visualization (MTH v2) 47 Results on historical Chinese dataset
  • 48. Visualization (SCUT-HCCDoc) 48 Results on camera-captured document images
  • 50. Recognition of Curved Text Line 50
  • 51. ICDAR WML 2023 Self-Supervised Learning for OCR How to mitigate the high dependence of deep learning models on large labeled data ? Data Synthesis/Data Augmentation Weakly Supervised Learning/ Weakly Annotated Data Self-Supervised Learning (SSL) Outline
  • 52. 52 • Self-supervised learning (SSL) is a machine learning method for unlabeled data. By training with a pre-defined pretext task, a good pre-trained model is obtained and can be used in various downstream tasks with enhanced performance • The SSL can be used to avoid the extensive cost of collecting and annotating large-scale datasets Longlong Jing and Yingli Tian, Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey, IEEE TPAMI 2020 A general Pipeline of SSL Self-supervised Learning (SSL)
  • 53. 53 Timeline of Visual SSL C. Zhang, A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond, arXiv 20220730, IJCAI 2023. Self-supervised Learning (SSL)
  • 54. 54 Discriminative SSL (eg. contrast learning) Generative SSL (e.g. Mask Image Modeling, MIM) SSL methods can be divided into two main categories, i.e., discriminative and generative. Typical methods:BYOL, SimCLR, MoCO,SimSiam… Typical methods :BEiT, iBERT, CAE, MAE, SimMIM… Self-supervised Learning (SSL)
  • 55. 55 SSL based large Pre-training model for document understanding Position Embeddings (1D,2D) Token Embeddings Position Embeddings (1D,2D) Token Embeddings Position Embeddings (1D,2D) SelfDoc CVPR 2021 StructuralLM ACL 2021 MVLM CPC Pre-training strategies Modality-Adaptive Self-Attention MLM&MVM MVLM Modality-Adaptive Self-Attention Encoder Modalities LayoutLM KDD 2020 Modality-Adaptive Self-Attention Token & Visual Embeddings Position Embeddings (1D,2D) LayoutLMv2 ACL 2021 TIM TIA Modality-Adaptive Self-Attention MVLM Spatial-Aware Self-Attention Cross-Modality Encoder Semantically meaningful components Embeddings MDC Token & Visual Segment Embedding Position Embeddings (1D,2D) StrucTexT ACMmm 2021 SLP PBD MVLM Modality-Adaptive Self-Attention Position Embeddings (1D,2D) DocFormer ICCV.2021 Token & Visual Embeddings LTR TDI MVLM Modality-Adaptive Self-Attention Multi-Modal Self-Attention Richer downstream task and multi-modal fusion: • MVLM : Masked Visual Language Modeling • MDC : Multi-label Document Classification • CPC:Cell Position Classification • SLP: Segment Length Prediction • LTR: Learn To Reconstruct • TDI: Text Describes Image • TIA :Text-Image Alignment • TIM: Text-Image Matching • PBD: Paired Box Direction • BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents, AAAI 2022 • LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding, ACL 2022 • …… p SSL in the field of document understanding
  • 56. 56 SSL Methods in the field of OCR SeqCLR[1] [1] Aberdam A, Litman R, Tsiper S, et al. Sequence-to-sequence contrastive learning for text recognition, CVPR 2021 [2] Liu H, Wang B, Bao Z, et al. Perceiving Stroke-Semantic Context: Hierarchical Contrastive Learning for Robust Scene Text Recognition. AAAI 2022. PerSec[2] Based on Contrastive Learning Based on SimCLR
  • 57. 57 SSL Methods in OCR [3] H.Guo, et al., Primitive Contrastive Learning for Handwritten Mathematical Expression Recognition , ICPR 2022 [4] X. Jiang, et al., Scene Text Recognition with Self-supervised Contrastive Predictive Coding, ICPR 2022 [5] Xiaoyi Zhang, et al., CMT-Co: Contrastive Learning with Character Movement Task for Handwritten Text Recognition, ICFHR 2022 PrimCLR [3] • Primitive contrastive learning • Handwritten mathematic expression STR-CPC [4] • Based on CPC (contrastive predictive coding) • A widthwise causal convolution is designed to alleviate the information overlap problem • Progressive Recovery Training Strategy (PRTS) CMT-Co [5] • Based on MoCo v2 • A new character unit cropping module is designed
  • 58. 58 SSL Methods in the field of OCR SimAN[6] [6] Luo C, Jin L, Chen J. SimAN: Exploring Self-Supervised Representation Learning of Scene Text via Similarity-Aware Normalization, CVPR. 2022. [7] Lyu P, et al. MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining[J]. arXiv preprint arXiv:2206.00311, 2022. [8] Yang M, Liao M, Lu P, et al. Reading and Writing: Discriminative and Generative Modeling for Self-Supervised Text Recognition, ACM MM, 2022. DiG[8] Hybrid Generative & Discriminative SSL Generative SSL MaskOCR[7]
  • 59. SSL Methods in OCR • Relational Contrastive Learning (RCL) for STR • Hierarchy representation learning (word, subword, frame) • Relational regularization, hierarchical relations, inter- hierarchy relational consistency [10] Jinglei Zhang, Tiancheng Lin, Relational Contrastive Learning for Scene Text Recognition, ACM MM 2023. p RCLSTR [9] Text-DIAE: A Self-Supervised Degradation Invariant Autoencoder for Text Recognition and Document Enhancement, AAAI 2023 p Text-DIAE • A new SSL method called Text- Degradation Invariant Auto Encoder (Text-DIAE) • Three pretext tasks (masking, blur and background noise) • Fast convergence 59
  • 60. p SelfDocSeg SSL Methods in OCR [11] Subhajit Maity, et al., SelfDocSeg: A self-supervised vision-based approach towards Document Segmentation, ICDAR 2023 [12] Z. Qiao, et al., Decoupling Visual-Semantic Features Learning with Dual Masked Autoencoder for Self-Supervised STR, ICDAR 2023 p Dual-MAE • Decouple visual and semantic feature learning with different masking strategies • A Siamese network is used to align the dual features • A new SSL method for document layout analysis based on BYOL • Pseudo-layouts (Layout mask) from the document images are generated to pre-train the image encoder • Fine-tuning by an object detector 60
  • 61. 61 SSL Methods in OCR Qing Jiang, Jiapeng Wang, Dezhi Peng, Chongyu Liu, Lianwen Jin*, Revisiting Scene Text Recognition: A Data Perspective, ICCV 2023 Paper Preprint: https://arxiv.org/abs/2307.08723 p MAEReg: A simple yet effective scene text recognizer based on ViT & MAE-SSL • A ViT backbone, a Transformer decoder, training by the MAE self-supervised learning with some minor modifications Code & Data are available at: https://union14m.github.io
  • 62. 62 SSL Methods in OCR Qing Jiang, Jiapeng Wang, Dezhi Peng, Chongyu Liu, Lianwen Jin*, Revisiting Scene Text Recognition: A Data Perspective, ICCV 2023 Paper Preprint: https://arxiv.org/abs/2307.08723 p MAEReg: A simple yet effective scene text recognizer based on ViT & MAE-SSL Code & Data are available at: https://union14m.github.io
  • 63. ICDAR WML 2023 63 Discussion & Prospects
  • 64. ICDAR WML 2023 Is the problem of OCR/STR nearly solved? • Great progress in the filed of OCR has been achieved in recent year • The current progress in scene text recognition (STR) has exhibited a trend of accuracy saturation 1. Whether the common benchmarks remain sufficient to promote future progress? 2. Whether this accuracy saturation implies that STR is solved? Qing Jiang, Jiapeng Wang, Dezhi Peng, Chongyu Liu, Lianwen Jin*, Revisiting Scene Text Recognition: A Data Perspective, ICCV 2023 • The six widely used datasets (IC13, SVT, IIIT, IC15, SVTP, CUTE80) are relatively small in scale • The six datasets lack representativeness of various real-world scenarios. • The six datasets are less challenging, thus concealing the underlying issues that STR faces 64 1
  • 65. ICDAR WML 2023 Revisiting OCR: A Data Perspective • We consolidate a large-scale real STR dataset, Union14M, to investigate the challenges faced by STR models in the real world • The Union14M benchmark contains two subsets, Union14M-L, Union14M-U • Union14M-L: 4M labeled images, 3,230,742 training samples, 400,000 validation samples, 409,383 testing samples • Union14M-U: 10M unlabeled images 13 representative STR models (trained on MJ & ST synthetic datasets) evaluate on Union14M-L show a sigificant performance degradation on the Union14M- L test set. Qing Jiang, Jiapeng Wang, Dezhi Peng, Chongyu Liu, Lianwen Jin*, Revisiting Scene Text Recognition: A Data Perspective, ICCV 2023
  • 66. ICDAR WML 2023 Revisiting OCR: A Data Perspective (cont.) • Experiments on the new Union14M-L benchmark show that STR is still far from being solved • Training with the new Union14M dataset produces much better performance Left table: Recognition accuracy of models trained on synthetic datasets (MJ and ST) Right table: Recognition accuracy of models trained on the training set of Union14M-L. For MAERec, S and B represent the use of ViT-Small and ViT-Base as the backbone, respectively. PT denotes pre-training Qing Jiang, Jiapeng Wang, Dezhi Peng, Chongyu Liu, Lianwen Jin*, Revisiting Scene Text Recognition: A Data Perspective, ICCV 2023 66
  • 67. ICDAR WML 2023 Revisiting OCR: A Data Perspective (cont.) • Self-supervised pre-training is an effective way to utilize massive amounts of unlabeled data (the MAERec model) • Quality of dataset is more important than the quantity of dataset Qing Jiang, Jiapeng Wang, Dezhi Peng, Chongyu Liu, Lianwen Jin*, Revisiting Scene Text Recognition: A Data Perspective, ICCV 2023 Code & Data are available at: https://union14m.github.io Paper preprint: https://arxiv.org/abs/2307.08723 67
  • 68. ICDAR WML 2023 GPT4: Potential OCR ability of LLM? 68 • On Nov.30, 2022, OpenAI released the ChatGPT large-scale language model (LLM) • GPT-4 (released on March 14, 2023) already show great potential for OCR 2
  • 69. ICDAR WML 2023 Multimodel LLM Demostrate Amazong Zero-shot OCR Capabilities Q. Ye, H. Xu, G. Xu, et al., mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality, arXiv 2023.04 69 Large multimodal models are a continuing research hotspot recently
  • 70. ICDAR WML 2023 Potential and limitations of multimodal LLM in the OCR Yuliang Liu, Zhang Li, et al., On the Hidden Mystery of OCR in Large Multimodal Models, arXiv 2023 • A comprehensive study of existing publicly available multimodal models, evaluating their performance in various OCR tasks • The preliminary assessment reveals that LMM can achieve promising results, especially in text recognition 70
  • 71. ICDAR WML 2023 CLIP for Scene Text Detection/Recognition W. Yu, Y, Liu, .. Xiang Bai*, Turning a CLIP Model into a Scene Text Detector, CVPR 2023 Shuai Zhao, et al., CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model, arXiv , 2023.05.23 • Introducing multimodal large models (e.g. CLIP) to assist in improving text detection and recognition performance is another promising direction in recent research 71 3
  • 72. ICDAR WML 2023 Toward a Unified Document Understanding Model(UDoP) Zineng Tang, Ziyi Yang, et al., Unifying Vision, Text, and Layout for Universal Document Processing, CVPR 2023 • Five major problem categories, more than 10 Document AI subtasks • SOTA on 8 Document AI tasks • Designing a unified OCR model that can handle different tasks is also an important research topic 72 4
  • 73. 73 Conclusion l OCR has been an active research field for over 40 years. It has become one of the core AI technologies in many industry applications l Many problems are still not completely solved Ø Complex layout analysis, understanding and reconstruction of complex document Ø End-to-end Visual Information Extraction (VIE) Ø Table recognition & understanding in the wild Ø Chart analysis and understanding Ø TextVQA, DocVQA in the wild Ø Robust tampered text detection in document Image l Large-scale pre-training model and SSL for OCR are important future research topics Ø OCR Big Model / OCR Fundamental Model ü Universal OCR model that can deal with various OCR tasks 73
  • 74. Thank you! August 26, 2023 Lianwen Jin Email!eelwjin@scut.edu.cn (primary); lianwen.jin@gmail.com (secondary) URL ! http://www.dlvc-lab.net/lianwen/ Lab of Deep Learning & Vision Computing South China University of Technology ICDAR WML 2023