This paper proposes using recurrent connectionist language models to improve LSTM-based Arabic text recognition in videos. It trains RNN and RNNME language models on a large Arabic text corpus and integrates them into an LSTM-CTC optical character recognition system using a modified beam search decoding scheme. Experimental results show the connectionist language models outperform n-gram models, improving word recognition rate by over 16% compared to the baseline model without a language model. The full system also outperforms a commercial OCR engine by over 35% word recognition rate.
3. Motivation
• Automatically recognizing such texts can avoid a large part of the manual video
annotation
• Particularly for this language, there are only very few works that have addressed
the problem of Arabic video OCR while many big Arabic news channels appeared
in the last two decades and more than half of a billion people in the world uses the
Arabic language
Abstract
3
4. Contribution
• Different recurrent connectionist language models to improve LSTM-based Arabic
text recognition in videos.
• Efficient joint decoding paradigm using language model and LSTM responses.
• Additional decoding hyper-parameters, extensively evaluated, that improve
recognition results and optimize running time.
•
Significant recognition improvement by integrating connectionist language
models that outperform n-grams contribution.
• Final Arabic OCR system that significantly outperforms commercial OCR engine.
Abstract
4
5. Related work
• Language Model
• N-grams have been considered as state-of-the-art LM for many years.
• The most important drawback is their inefficiency to represent long-context patterns.
• For massive amount of training data, a large part of patterns cannot be effectively
represented and discovered during training.
5
6. Related work
• Neural Networks
• NN-based LMs have been introduced more than a decade ago by Elman and Bengio et al. and have been
successfully used for automatic speech recognition as well as for offline HTR .
• The main drawback of these models remains in their high computational complexity.
• RNN-based LM lies in its representation of the history :
• Unlike the previously mentioned models, context patterns are learned from data.
• The history is presented recurrently by the hidden layer of the network
• RNN language modelers can handle arbitrarily long contexts.
6
7. Proposed Methodology
In this work :
Author’s focus two main factors to reach better improvements
(1)What type of language model to choose
(2)How to integrate it in the decoding schema
7
8. Proposed Methodology
1 • Arabic OCR system
2
• Language modeling
• RNN-based language modeling
• RNNME: Joint learning of RNN and ME LMs
3
• Decoding schema
8
9. Proposed Methodology
Arabic OCR system
• It takes as input a text image without any pre-processing or prior segmentation
• Transform Arabic text images into sequences of relevant learned features
• Text transcription is then performed using the BLSTM-CTC schema
9
10. Proposed Methodology
Arabic OCR system - Feature extraction
• We apply a multi-scale scanning of the input text image using 4 sliding windows with different
aspect ratios
• Each window is then transformed to a set of learned features using a Convolutional Neural
Network (ConvNet)
10
11. Proposed Methodology
Language modeling
• RNN-based language modeling
• The network is trained using truncated BPTT to avoid two extreme cases during training
• The first case leads to high computational complexity ,and does not give the network the
opportunity to store relevant context information that can be useful in the future
• The second case is the bottleneck of the vanishing gradient problem
• RNNME :Joint learning of RNN and ME LMs
• Usually, features are hand-designed
• Weight setup
11
12. Proposed Methodology
Decoding schema
• The goal of this project:
• The decoding stage aims at finding the most probable transcription given these outputs
12
13. Experimental setup
1 • OCR system setup
2
• Language models set-up
3
• Primary results
4
• Tuning decoding parameters
5
• Final results
13
14. • OCR system setup
• The goal of this project:
• Study the contribution of the language modeling in optical text recognition
• The effect of their integration paradigm into the decoding stage
Experimental setup
14
15. • OCR system setup
Experimental setup
Item Note
Dataset • ALIF Dataset
(composed of Arabic text images extracted from Arabic TV Broadcast)
BLSTM-CTC
component
• Trained using 7673 text images
(the 4152 examples of ALIF_Train subset augmented to 7673 examples by applying some image processing
operations like color inversion, blurring, etc)
Algorithm • Stochastic gradient descent algorithm
Learning Rate=10 -4,Momentum=0.9,Random initial weights:[-0.1,0.1]
15
16. • OCR system setup
Experimental setup
Item Note
ConvNET Feature
Extraction
• Initially from these images, 20,571 character image
• Apply some scaling and color inversion operations
• Obtain a set of 46,689 single character images
• To train and evaluate the ConvNet model
BLSTM Network
Training
• All Arabic letter shapes have been considered
• The OCR component here considers the different shapes of a letter depending on its
position in a word
• In this work, for Arabic language modeling and final text transcription we consider
atomic Arabic letters
16
17. • Language models set-up
• The goal of this project:
• Choose Language model
Experimental setup
17
18. • Language models set-up – Get Language dataset
Experimental setup
Item Note
Build Arabic text lines
Dataset
Dataset Source:
• Ajdir Corpora
(From : Arabic newspapers)
• Watan-2004/Khaleej-2004 Corpora
(From : Arabic newspapers)
• Open Source Arabic Corpora
(From : Includes texts collected from Arabic TV channels websites like BBC and CNN. Initially)
To fit context • Cut the obtained text (dedicated to LM) into text lines with a limited number of
words
Note : Dataset contains in total 52.08M characters 18
19. • Language models set-up – Get Language dataset
Experimental setup
Item Note
Other processing steps • Removing non-Arabic characters, digits and extra punctuation marks and an
important number of text repetitions
• Text lines are then split into individual characters
• The space between words is replaced by a specific label
Note : Dataset contains in total 52.08M characters 19
20. • Language models set-up – Training Methodology
Experimental setup
Item Note
Dataset Spilt • Randomly select text lines with 44.47M characters to train the LMs.
• The remaining text lines are split into two subsets :
• 4.29M characters for validation
• The other, denoted TEXT_Test set, with 3.32M characters for test.
Algorithm • Train different RNN and RNNME models using the stochastic gradient descent
• An evaluation of the entropy on the validation set
• Learning rate = 0.1
Note: The entropy reflects how well the LM as a probabilistic model predicts samples 20
21. • Language models set-up – Training Methodology
Experimental setup
Item Note
Training language model • Train, in parallel, n-gram LMs using the SRILM toolkit
• The models are smoothed using the Witten-Bell discounting and the order is
tuned on the validation set
• Best entropy results are obtained with a 7-gram LM
Note: The entropy reflects how well the LM as a probabilistic model predicts samples
21
22. • Language models set-up – Supplement
• For the tuning of the joint decoding
• Use the WRR criterion and a separate development set, denoted DEV_Set(made
up of 1225 text images)
• Final OCR System Test
• Use text images from ALIF_Test1 and ALIF_Test2 sets
• The subsets contain 827 and 1175 text images respectively.
Experimental setup
22
23. • Primary results -
Experimental setup
In the OCR schema,each LMs entropy result are presented in Fig5.
Dataset : TEXT_Test set
23
24. • Primary results -
Experimental setup
Dataset : DEV_Set
Decoding schemes are evaluated in terms of WRR on the DEV_Set text images
• Models with lower entropy yield better recognition rates
when integrated into the decoding
• Connectionist LMs outperform n-grams both in terms
of entropy and WRR
24
25. • Primary results
Experimental setup
Entropy and WRR training (RNN-700 LM)
Result:
This reduction corresponds to a progressive improvement in WRR while applying the joint decoding
at each of these epochs which proves once again the correctness of the proposed joint decoding schema.
25
28. • Tuning decoding parameters - LM weights
Experimental setup
Based on experimental results:
Parameter(ω1,ω2) = (0.7,0.55)。
The map clearly shows the effect of the LM weights
on the recognition results
28
29. • Tuning decoding parameters-The beam width
• To analyze the impact of the beam width
on decoding results :
• Model :Considered the two best previously performing models
• RNN-700 and RNN-ME-300
• Research point :
• The impact of the beam width in terms of WRR and average
processing time per word
• (a):beam width / WRR
• RNN-ME-300 is better more
• (b:)beam width / average processing time
Experimental setup
29
30. • Final results
Experimental setup
• This improve- ment reaches almost 16% with the BS-RNN-700 schema on both datasets
• The character-level connectionist LMs still outperform the n-gram LM in terms of WRR.
• In terms of speed, the BS-7-gram is faster than other LMs
32
31. • Final results - The performance of our proposed text recognizer
Experimental setup
• Chosen a well-known commercial OCR engine,“ABBYY Fine Reader 12”
• The Arabic OCR component of this engine has been applied on the selected images of ALIF_test1.
• Excluded examples with digits and punctuation marks.
• Our methods largely outperform the ABBYY system by more than 35 points in terms of WRR.
33
32. • Final results - Other instructions
Experimental setup
• Finally, we illustrate, in Fig. 10, some of our OCR system outputs corrected by joint decoding using the RNN-700 LM
• The linguistic information is able to correct confusions between similar characters like ‘Saad’ and ‘Ayn’
34
33. Result – How much model in this paper
Item Note
Language model RNN-based:
RNN-100 , RNN-300 , RNN-500 , RNN-700
RNNME-base:
RNNME-100 , RNNME-300 , RNNME-500
N-gram-based:
7-gram LM
Total:8
Tunning decoding parameters RNNME-300 , RNN-700
• LM weight(0.7,0.55)
• Beam width(20,6.5)
BS-RNN700 , BS-RNNME-300
• Score pruning
Compare with commercial OCR
engine
BS-NO-LM , BS-RNN-700 , BS-RNN-300
35
34. Result
項目 說明
1 使用LMs 進行視頻中的阿拉伯文本識別
2 利用RNN來模擬語言中的長程依賴性(建構了兩種類型的character-level language models:一種是based
on RNNs,另一種是based on a joint learning of Maximum Entropy and RNN models)
3 Beam Search的修改版本的解碼模式
4 decoding引入了hyper parameters,以便在識別結果和響應時間之間達到更好的平衡
5 使用公開的ALIF數據集對整個範例進行了廣泛的評估
6 論文中的語言模型在WRR方面優於n-gram超過4 points
7 這些模型的貢獻和解碼模式的影響,與BLSTM-CTC OCR相比,WRR達到了近16 points的改進( BS-
RNN-700 與 BS-NO-LM)
8 結果:論文中所使用的方法識別率良好,優於商用OCR引擎近36%
(BS-RNN-700 與 ABBYY Fine Reader 12)
36