Tesseract. Recognizing Errors in Recognition Software

Tesseract. Recognizing Errors in Recognition
Software
Author: Andrey Karpov
Date: 21.05.2014
Tesseract is a free software program for text recognition developed by Google. According to the project
description, "Tesseract is probably the most accurate open source OCR engine available". And what if
we try to catch some bugs there with the help of the CppCat analyzer?
Tesseract
Tesseract is an optical character recognition engine for various operating systems and is free software
originally developed as proprietary software in Hewlett Packard labs between 1985 and 1994, with
some more changes made in 1996 to port to Windows, and some migration from C to C++ in 1998. A lot
of the code was written in C, and then some more was written in C++. Since then all the code has been
converted to at least compile with a C++ compiler. Very little work was done in the following decade. It
was then released as open source in 2005 by Hewlett Packard and the University of Nevada, Las Vegas
(UNLV). Tesseract development has been sponsored by Google since 2006. [taken from Wikipedia]
The source code of the project is available at Google Code: https://code.google.com/p/tesseract-ocr/
The size of the source code is about 16 Mbytes.
CppCat
I used the CppCat analyzer to check the project. This tool is a lightweight version of the PVS-Studio
analyzer, but it is quite enough for small projects like Tesseract, so I had no problems checking it with
CppCat.
If you are hesitating about what to download first - CppCat or PVS-Studio - choose the former. And only
if you find CppCat insufficient for you task or feel you're missing some functionality, move on to PVS-Studio.
To learn more about the differences between the two, see the article "An Alternative to PVS-Studio
at $250".
Analysis results
Below I will cite those code fragments that caught my attention while examining CppCat's analysis
report. I could have probably missed something, so Tesseract's authors should carry out their own

analysis. The trial version is active through 7 days, which is more than enough for such a small project. It
will be then up to them to decide if they want to use the tool regularly and catch typos or not.
As usual, let me remind you the basic law: the static analysis methodology is all about using it regularly,
not on rare occasions.
Poor division
void LanguageModel::FillConsistencyInfo(....)
{
....
float gap_ratio = expected_gap / actual_gap;
if (gap_ratio < 1/2 || gap_ratio > 2) {
consistency_info->num_inconsistent_spaces++;
....
}
CppCat's diagnostic messages: V636 The '1 / 2' expression was implicitly casted from 'int' type to 'float'
type. Consider utilizing an explicit type cast to avoid the loss of a fractional part. An example: double A =
(double)(X) / Y;. language_model.cpp 1163
The programmer wanted to compare the 'gap_ratio' variable with the value 0.5. Unfortunately, he chose
a poor way to write 0.5. 1/2 is integer division and evaluates to 0.
The correct code should look like this:
if (gap_ratio < 1.0f/2 || gap_ratio > 2) {
or this:
if (gap_ratio < 0.5f || gap_ratio > 2) {
There are some other fragments with suspicious integer division. Some of them may also contain really
unpleasant errors.
The following are the code fragments that need to be checked:
• baselinedetect.cpp 110
• bmp_8.cpp 983
• cjkpitch.cpp 553
• cjkpitch.cpp 564
• mfoutline.cpp 392
• mfoutline.cpp 393
• normalis.cpp 454
Typo in a comparison
uintmax_t streamtoumax(FILE* s, int base) {
int d, c = 0;
....
c = fgetc(s);
if (c == 'x' && c == 'X') c = fgetc(s);

....
}
CppCat's diagnostic message: V547 Expression 'c == 'x' && c == 'X'' is always false. Probably the '||'
operator should be used here. scanutils.cpp 135
The fixed check:
if (c == 'x' || c == 'X') c = fgetc(s);
Undefined behavior
I have discovered one interesting construct I have never seen before:
void TabVector::Evaluate(....) {
....
int num_deleted_boxes = 0;
....
++num_deleted_boxes = true;
....
}
CppCat's diagnostic message: V567 Undefined behavior. The 'num_deleted_boxes' variable is modified
while being used twice between sequence points. tabvector.cpp 735
It's not clear what the author meant by this code; it must be the result of a typo.
The result of this expression can't be predicted: the variable 'num_deleted_boxes' may be incremented
both before and after the assignment. The reason is that the variable changes twice in one sequence
point.
Other errors causing undefined behavior are related to shifts. For example:
void Dawg::init(....)
{
....
letter_mask_ = ~(~0 << flag_start_bit_);
....
}
Diagnostic message V610 Undefined behavior. Check the shift operator '<<. The left operand '~0' is
negative. dawg.cpp 187
The '~0' expression is of the 'int' type and evaluates to '-1'. Shifting negative values causes undefined
behavior, so it is just pure luck that the program works well. To fix the bug, we need to make '0'
unsigned:
letter_mask_ = ~(~0u << flag_start_bit_);
But that's not all. This line also triggers one more warning:

V629 Consider inspecting the '~0 << flag_start_bit_' expression. Bit shifting of the 32-bit value with a
subsequent expansion to the 64-bit type. dawg.cpp 187
The point is that the variable 'letter_mask_' is of the 'uinT64' type. As far as I understand, it may be
needed to write ones into the most significant 32 bits. In this case, the implemented expression is
incorrect because it can handle only the least significant bits.
We need to make '0' of a 64-bit type:
letter_mask_ = ~(~0ull << flag_start_bit_);
Here is a list of other code fragments where negative numbers are shifted:
• dawg.cpp 188
• intmatcher.cpp 172
Suspicious double assignment
TESSLINE* ApproximateOutline(....) {
EDGEPT *edgept;
....
edgept = edgesteps_to_edgepts(c_outline, edgepts);
fix2(edgepts, area);
edgept = poly2 (edgepts, area); // 2nd approximation.
....
}
CppCat's diagnostic message: V519 The 'edgept' variable is assigned values twice successively. Perhaps
this is a mistake. Check lines: 76, 78. polyaprx.cpp 78
Another similar error:
inT32 row_words2(....)

{
....
this_valid = blob_box.width () >= min_width;
this_valid = TRUE;
....
}
CppCat's diagnostic message: V519 The 'this_valid' variable is assigned values twice successively.
Perhaps this is a mistake. Check lines: 396, 397. wordseg.cpp 397
Incorrect order of class member initialization
Let's examine the 'MasterTrainer' class first. Notice that the 'samples_' member is written before the
'fontinfo_table_' member:
class MasterTrainer {
....
TrainingSampleSet samples_;
....
FontInfoTable fontinfo_table_;
....
};
According to the standard, class members are initialized in the constructor in the same order as they are
declared inside the class. It means that 'samples_' will be initialized PRIOR to 'fontinfo_table_'.
Now let's examine the constructor:
MasterTrainer::MasterTrainer(NormalizationMode norm_mode,
bool shape_analysis,
bool replicate_samples,
int debug_level)
: norm_mode_(norm_mode), samples_(fontinfo_table_),
junk_samples_(fontinfo_table_),
verify_samples_(fontinfo_table_),
charsetsize_(0),
enable_shape_anaylsis_(shape_analysis),
enable_replication_(replicate_samples),
fragments_(NULL), prev_unichar_id_(-1),
debug_level_(debug_level)
{
}

The trouble is about using a yet uninitialized variable 'fontinfo_table_' to initialize 'samples_'.
A similar problem in this class is with initializing the fields 'junk_samples_' and 'verify_samples_'.
I cannot say for sure what to do with this class. Perhaps it would be sufficient just to move the
declaration of 'fontinfo_table_' into the very beginning of the class.
Typo in a condition
This typo is not clearly seen, but the analyzer is always alert.
class ScriptDetector {
....
int korean_id_;
int japanese_id_;
int katakana_id_;
int hiragana_id_;
int han_id_;
int hangul_id_;
int latin_id_;
int fraktur_id_;
....
};
void ScriptDetector::detect_blob(BLOB_CHOICE_LIST* scores) {
....
if (prev_id == katakana_id_)
osr_->scripts_na[i][japanese_id_] += 1.0;
if (prev_id == hiragana_id_)
osr_->scripts_na[i][japanese_id_] += 1.0;
if (prev_id == hangul_id_)
osr_->scripts_na[i][korean_id_] += 1.0;
if (prev_id == han_id_)
osr_->scripts_na[i][korean_id_] += kHanRatioInKorean;
if (prev_id == han_id_) <<<<====
osr_->scripts_na[i][japanese_id_] += kHanRatioInJapanese;
....
}

CppCat's diagnostic message: V581 The conditional expressions of the 'if' operators situated alongside
each other are identical. Check lines: 551, 553. osdetect.cpp 553
The very last comparison is very likely to look like this:
if (prev_id == japanese_id_)
Unnecessary checks
There is no need to check the return result of the 'new' operator. If memory cannot be allocated, it will
throw an exception. You can, of course, implement a special 'new' operator that returns null pointers,
but that is a special case (learn more).
Keeping that in mind, we can simplify the following function:
void SetLabel(char_32 label) {
if (label32_ != NULL) {
delete []label32_;
}
label32_ = new char_32[2];
if (label32_ != NULL) {
label32_[0] = label;
label32_[1] = 0;
}
}
CppCat's diagnostic message: V668 There is no sense in testing the 'label32_' pointer against null, as the
memory was allocated using the 'new' operator. The exception will be generated in the case of memory
allocation error. char_samp.h 73
There are 101 other fragments where a pointer returned by the 'new' operator is checked. I don't find it
reasonable to enumerate them all here - you'd better launch CppCat and find them yourself.
Conclusion
Please use static analysis regularly - it will help you save much time to spend on solving more useful
tasks than catching silly mistakes and typos.
And don't forget to follow me on Twitter: @Code_Analysis. I regularly publish links to interesting articles
on C++ there.

Tesseract. Recognizing Errors in Recognition Software

More Related Content

What's hot

Viewers also liked

Similar to Tesseract. Recognizing Errors in Recognition Software

More from Andrey Karpov

Recently uploaded

Tesseract. Recognizing Errors in Recognition Software