Tesseract. Recognizing Errors in Recognition 
Software 
Author: Andrey Karpov 
Date: 21.05.2014 
Tesseract is a free software program for text recognition developed by Google. According to the project 
description, "Tesseract is probably the most accurate open source OCR engine available". And what if 
we try to catch some bugs there with the help of the CppCat analyzer? 
Tesseract 
Tesseract is an optical character recognition engine for various operating systems and is free software 
originally developed as proprietary software in Hewlett Packard labs between 1985 and 1994, with 
some more changes made in 1996 to port to Windows, and some migration from C to C++ in 1998. A lot 
of the code was written in C, and then some more was written in C++. Since then all the code has been 
converted to at least compile with a C++ compiler. Very little work was done in the following decade. It 
was then released as open source in 2005 by Hewlett Packard and the University of Nevada, Las Vegas 
(UNLV). Tesseract development has been sponsored by Google since 2006. [taken from Wikipedia] 
The source code of the project is available at Google Code: https://code.google.com/p/tesseract-ocr/ 
The size of the source code is about 16 Mbytes. 
CppCat 
I used the CppCat analyzer to check the project. This tool is a lightweight version of the PVS-Studio 
analyzer, but it is quite enough for small projects like Tesseract, so I had no problems checking it with 
CppCat. 
If you are hesitating about what to download first - CppCat or PVS-Studio - choose the former. And only 
if you find CppCat insufficient for you task or feel you're missing some functionality, move on to PVS-Studio. 
To learn more about the differences between the two, see the article "An Alternative to PVS-Studio 
at $250". 
Analysis results 
Below I will cite those code fragments that caught my attention while examining CppCat's analysis 
report. I could have probably missed something, so Tesseract's authors should carry out their own
analysis. The trial version is active through 7 days, which is more than enough for such a small project. It 
will be then up to them to decide if they want to use the tool regularly and catch typos or not. 
As usual, let me remind you the basic law: the static analysis methodology is all about using it regularly, 
not on rare occasions. 
Poor division 
void LanguageModel::FillConsistencyInfo(....) 
{ 
.... 
float gap_ratio = expected_gap / actual_gap; 
if (gap_ratio < 1/2 || gap_ratio > 2) { 
consistency_info->num_inconsistent_spaces++; 
.... 
} 
CppCat's diagnostic messages: V636 The '1 / 2' expression was implicitly casted from 'int' type to 'float' 
type. Consider utilizing an explicit type cast to avoid the loss of a fractional part. An example: double A = 
(double)(X) / Y;. language_model.cpp 1163 
The programmer wanted to compare the 'gap_ratio' variable with the value 0.5. Unfortunately, he chose 
a poor way to write 0.5. 1/2 is integer division and evaluates to 0. 
The correct code should look like this: 
if (gap_ratio < 1.0f/2 || gap_ratio > 2) { 
or this: 
if (gap_ratio < 0.5f || gap_ratio > 2) { 
There are some other fragments with suspicious integer division. Some of them may also contain really 
unpleasant errors. 
The following are the code fragments that need to be checked: 
• baselinedetect.cpp 110 
• bmp_8.cpp 983 
• cjkpitch.cpp 553 
• cjkpitch.cpp 564 
• mfoutline.cpp 392 
• mfoutline.cpp 393 
• normalis.cpp 454 
Typo in a comparison 
uintmax_t streamtoumax(FILE* s, int base) { 
int d, c = 0; 
.... 
c = fgetc(s); 
if (c == 'x' && c == 'X') c = fgetc(s);
.... 
} 
CppCat's diagnostic message: V547 Expression 'c == 'x' && c == 'X'' is always false. Probably the '||' 
operator should be used here. scanutils.cpp 135 
The fixed check: 
if (c == 'x' || c == 'X') c = fgetc(s); 
Undefined behavior 
I have discovered one interesting construct I have never seen before: 
void TabVector::Evaluate(....) { 
.... 
int num_deleted_boxes = 0; 
.... 
++num_deleted_boxes = true; 
.... 
} 
CppCat's diagnostic message: V567 Undefined behavior. The 'num_deleted_boxes' variable is modified 
while being used twice between sequence points. tabvector.cpp 735 
It's not clear what the author meant by this code; it must be the result of a typo. 
The result of this expression can't be predicted: the variable 'num_deleted_boxes' may be incremented 
both before and after the assignment. The reason is that the variable changes twice in one sequence 
point. 
Other errors causing undefined behavior are related to shifts. For example: 
void Dawg::init(....) 
{ 
.... 
letter_mask_ = ~(~0 << flag_start_bit_); 
.... 
} 
Diagnostic message V610 Undefined behavior. Check the shift operator '<<. The left operand '~0' is 
negative. dawg.cpp 187 
The '~0' expression is of the 'int' type and evaluates to '-1'. Shifting negative values causes undefined 
behavior, so it is just pure luck that the program works well. To fix the bug, we need to make '0' 
unsigned: 
letter_mask_ = ~(~0u << flag_start_bit_); 
But that's not all. This line also triggers one more warning:
V629 Consider inspecting the '~0 << flag_start_bit_' expression. Bit shifting of the 32-bit value with a 
subsequent expansion to the 64-bit type. dawg.cpp 187 
The point is that the variable 'letter_mask_' is of the 'uinT64' type. As far as I understand, it may be 
needed to write ones into the most significant 32 bits. In this case, the implemented expression is 
incorrect because it can handle only the least significant bits. 
We need to make '0' of a 64-bit type: 
letter_mask_ = ~(~0ull << flag_start_bit_); 
Here is a list of other code fragments where negative numbers are shifted: 
• dawg.cpp 188 
• intmatcher.cpp 172 
• intmatcher.cpp 174 
• intmatcher.cpp 176 
• intmatcher.cpp 178 
• intmatcher.cpp 180 
• intmatcher.cpp 182 
• intmatcher.cpp 184 
• intmatcher.cpp 186 
• intmatcher.cpp 188 
• intmatcher.cpp 190 
• intmatcher.cpp 192 
• intmatcher.cpp 194 
• intmatcher.cpp 196 
• intmatcher.cpp 198 
• intmatcher.cpp 200 
• intmatcher.cpp 202 
• intmatcher.cpp 323 
• intmatcher.cpp 347 
• intmatcher.cpp 366 
Suspicious double assignment 
TESSLINE* ApproximateOutline(....) { 
EDGEPT *edgept; 
.... 
edgept = edgesteps_to_edgepts(c_outline, edgepts); 
fix2(edgepts, area); 
edgept = poly2 (edgepts, area); // 2nd approximation. 
.... 
} 
CppCat's diagnostic message: V519 The 'edgept' variable is assigned values twice successively. Perhaps 
this is a mistake. Check lines: 76, 78. polyaprx.cpp 78 
Another similar error: 
inT32 row_words2(....)
{ 
.... 
this_valid = blob_box.width () >= min_width; 
this_valid = TRUE; 
.... 
} 
CppCat's diagnostic message: V519 The 'this_valid' variable is assigned values twice successively. 
Perhaps this is a mistake. Check lines: 396, 397. wordseg.cpp 397 
Incorrect order of class member initialization 
Let's examine the 'MasterTrainer' class first. Notice that the 'samples_' member is written before the 
'fontinfo_table_' member: 
class MasterTrainer { 
.... 
TrainingSampleSet samples_; 
.... 
FontInfoTable fontinfo_table_; 
.... 
}; 
According to the standard, class members are initialized in the constructor in the same order as they are 
declared inside the class. It means that 'samples_' will be initialized PRIOR to 'fontinfo_table_'. 
Now let's examine the constructor: 
MasterTrainer::MasterTrainer(NormalizationMode norm_mode, 
bool shape_analysis, 
bool replicate_samples, 
int debug_level) 
: norm_mode_(norm_mode), samples_(fontinfo_table_), 
junk_samples_(fontinfo_table_), 
verify_samples_(fontinfo_table_), 
charsetsize_(0), 
enable_shape_anaylsis_(shape_analysis), 
enable_replication_(replicate_samples), 
fragments_(NULL), prev_unichar_id_(-1), 
debug_level_(debug_level) 
{ 
}
The trouble is about using a yet uninitialized variable 'fontinfo_table_' to initialize 'samples_'. 
A similar problem in this class is with initializing the fields 'junk_samples_' and 'verify_samples_'. 
I cannot say for sure what to do with this class. Perhaps it would be sufficient just to move the 
declaration of 'fontinfo_table_' into the very beginning of the class. 
Typo in a condition 
This typo is not clearly seen, but the analyzer is always alert. 
class ScriptDetector { 
.... 
int korean_id_; 
int japanese_id_; 
int katakana_id_; 
int hiragana_id_; 
int han_id_; 
int hangul_id_; 
int latin_id_; 
int fraktur_id_; 
.... 
}; 
void ScriptDetector::detect_blob(BLOB_CHOICE_LIST* scores) { 
.... 
if (prev_id == katakana_id_) 
osr_->scripts_na[i][japanese_id_] += 1.0; 
if (prev_id == hiragana_id_) 
osr_->scripts_na[i][japanese_id_] += 1.0; 
if (prev_id == hangul_id_) 
osr_->scripts_na[i][korean_id_] += 1.0; 
if (prev_id == han_id_) 
osr_->scripts_na[i][korean_id_] += kHanRatioInKorean; 
if (prev_id == han_id_) <<<<==== 
osr_->scripts_na[i][japanese_id_] += kHanRatioInJapanese; 
.... 
}
CppCat's diagnostic message: V581 The conditional expressions of the 'if' operators situated alongside 
each other are identical. Check lines: 551, 553. osdetect.cpp 553 
The very last comparison is very likely to look like this: 
if (prev_id == japanese_id_) 
Unnecessary checks 
There is no need to check the return result of the 'new' operator. If memory cannot be allocated, it will 
throw an exception. You can, of course, implement a special 'new' operator that returns null pointers, 
but that is a special case (learn more). 
Keeping that in mind, we can simplify the following function: 
void SetLabel(char_32 label) { 
if (label32_ != NULL) { 
delete []label32_; 
} 
label32_ = new char_32[2]; 
if (label32_ != NULL) { 
label32_[0] = label; 
label32_[1] = 0; 
} 
} 
CppCat's diagnostic message: V668 There is no sense in testing the 'label32_' pointer against null, as the 
memory was allocated using the 'new' operator. The exception will be generated in the case of memory 
allocation error. char_samp.h 73 
There are 101 other fragments where a pointer returned by the 'new' operator is checked. I don't find it 
reasonable to enumerate them all here - you'd better launch CppCat and find them yourself. 
Conclusion 
Please use static analysis regularly - it will help you save much time to spend on solving more useful 
tasks than catching silly mistakes and typos. 
And don't forget to follow me on Twitter: @Code_Analysis. I regularly publish links to interesting articles 
on C++ there.

Tesseract. Recognizing Errors in Recognition Software

  • 1.
    Tesseract. Recognizing Errorsin Recognition Software Author: Andrey Karpov Date: 21.05.2014 Tesseract is a free software program for text recognition developed by Google. According to the project description, "Tesseract is probably the most accurate open source OCR engine available". And what if we try to catch some bugs there with the help of the CppCat analyzer? Tesseract Tesseract is an optical character recognition engine for various operating systems and is free software originally developed as proprietary software in Hewlett Packard labs between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some migration from C to C++ in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler. Very little work was done in the following decade. It was then released as open source in 2005 by Hewlett Packard and the University of Nevada, Las Vegas (UNLV). Tesseract development has been sponsored by Google since 2006. [taken from Wikipedia] The source code of the project is available at Google Code: https://code.google.com/p/tesseract-ocr/ The size of the source code is about 16 Mbytes. CppCat I used the CppCat analyzer to check the project. This tool is a lightweight version of the PVS-Studio analyzer, but it is quite enough for small projects like Tesseract, so I had no problems checking it with CppCat. If you are hesitating about what to download first - CppCat or PVS-Studio - choose the former. And only if you find CppCat insufficient for you task or feel you're missing some functionality, move on to PVS-Studio. To learn more about the differences between the two, see the article "An Alternative to PVS-Studio at $250". Analysis results Below I will cite those code fragments that caught my attention while examining CppCat's analysis report. I could have probably missed something, so Tesseract's authors should carry out their own
  • 2.
    analysis. The trialversion is active through 7 days, which is more than enough for such a small project. It will be then up to them to decide if they want to use the tool regularly and catch typos or not. As usual, let me remind you the basic law: the static analysis methodology is all about using it regularly, not on rare occasions. Poor division void LanguageModel::FillConsistencyInfo(....) { .... float gap_ratio = expected_gap / actual_gap; if (gap_ratio < 1/2 || gap_ratio > 2) { consistency_info->num_inconsistent_spaces++; .... } CppCat's diagnostic messages: V636 The '1 / 2' expression was implicitly casted from 'int' type to 'float' type. Consider utilizing an explicit type cast to avoid the loss of a fractional part. An example: double A = (double)(X) / Y;. language_model.cpp 1163 The programmer wanted to compare the 'gap_ratio' variable with the value 0.5. Unfortunately, he chose a poor way to write 0.5. 1/2 is integer division and evaluates to 0. The correct code should look like this: if (gap_ratio < 1.0f/2 || gap_ratio > 2) { or this: if (gap_ratio < 0.5f || gap_ratio > 2) { There are some other fragments with suspicious integer division. Some of them may also contain really unpleasant errors. The following are the code fragments that need to be checked: • baselinedetect.cpp 110 • bmp_8.cpp 983 • cjkpitch.cpp 553 • cjkpitch.cpp 564 • mfoutline.cpp 392 • mfoutline.cpp 393 • normalis.cpp 454 Typo in a comparison uintmax_t streamtoumax(FILE* s, int base) { int d, c = 0; .... c = fgetc(s); if (c == 'x' && c == 'X') c = fgetc(s);
  • 3.
    .... } CppCat'sdiagnostic message: V547 Expression 'c == 'x' && c == 'X'' is always false. Probably the '||' operator should be used here. scanutils.cpp 135 The fixed check: if (c == 'x' || c == 'X') c = fgetc(s); Undefined behavior I have discovered one interesting construct I have never seen before: void TabVector::Evaluate(....) { .... int num_deleted_boxes = 0; .... ++num_deleted_boxes = true; .... } CppCat's diagnostic message: V567 Undefined behavior. The 'num_deleted_boxes' variable is modified while being used twice between sequence points. tabvector.cpp 735 It's not clear what the author meant by this code; it must be the result of a typo. The result of this expression can't be predicted: the variable 'num_deleted_boxes' may be incremented both before and after the assignment. The reason is that the variable changes twice in one sequence point. Other errors causing undefined behavior are related to shifts. For example: void Dawg::init(....) { .... letter_mask_ = ~(~0 << flag_start_bit_); .... } Diagnostic message V610 Undefined behavior. Check the shift operator '<<. The left operand '~0' is negative. dawg.cpp 187 The '~0' expression is of the 'int' type and evaluates to '-1'. Shifting negative values causes undefined behavior, so it is just pure luck that the program works well. To fix the bug, we need to make '0' unsigned: letter_mask_ = ~(~0u << flag_start_bit_); But that's not all. This line also triggers one more warning:
  • 4.
    V629 Consider inspectingthe '~0 << flag_start_bit_' expression. Bit shifting of the 32-bit value with a subsequent expansion to the 64-bit type. dawg.cpp 187 The point is that the variable 'letter_mask_' is of the 'uinT64' type. As far as I understand, it may be needed to write ones into the most significant 32 bits. In this case, the implemented expression is incorrect because it can handle only the least significant bits. We need to make '0' of a 64-bit type: letter_mask_ = ~(~0ull << flag_start_bit_); Here is a list of other code fragments where negative numbers are shifted: • dawg.cpp 188 • intmatcher.cpp 172 • intmatcher.cpp 174 • intmatcher.cpp 176 • intmatcher.cpp 178 • intmatcher.cpp 180 • intmatcher.cpp 182 • intmatcher.cpp 184 • intmatcher.cpp 186 • intmatcher.cpp 188 • intmatcher.cpp 190 • intmatcher.cpp 192 • intmatcher.cpp 194 • intmatcher.cpp 196 • intmatcher.cpp 198 • intmatcher.cpp 200 • intmatcher.cpp 202 • intmatcher.cpp 323 • intmatcher.cpp 347 • intmatcher.cpp 366 Suspicious double assignment TESSLINE* ApproximateOutline(....) { EDGEPT *edgept; .... edgept = edgesteps_to_edgepts(c_outline, edgepts); fix2(edgepts, area); edgept = poly2 (edgepts, area); // 2nd approximation. .... } CppCat's diagnostic message: V519 The 'edgept' variable is assigned values twice successively. Perhaps this is a mistake. Check lines: 76, 78. polyaprx.cpp 78 Another similar error: inT32 row_words2(....)
  • 5.
    { .... this_valid= blob_box.width () >= min_width; this_valid = TRUE; .... } CppCat's diagnostic message: V519 The 'this_valid' variable is assigned values twice successively. Perhaps this is a mistake. Check lines: 396, 397. wordseg.cpp 397 Incorrect order of class member initialization Let's examine the 'MasterTrainer' class first. Notice that the 'samples_' member is written before the 'fontinfo_table_' member: class MasterTrainer { .... TrainingSampleSet samples_; .... FontInfoTable fontinfo_table_; .... }; According to the standard, class members are initialized in the constructor in the same order as they are declared inside the class. It means that 'samples_' will be initialized PRIOR to 'fontinfo_table_'. Now let's examine the constructor: MasterTrainer::MasterTrainer(NormalizationMode norm_mode, bool shape_analysis, bool replicate_samples, int debug_level) : norm_mode_(norm_mode), samples_(fontinfo_table_), junk_samples_(fontinfo_table_), verify_samples_(fontinfo_table_), charsetsize_(0), enable_shape_anaylsis_(shape_analysis), enable_replication_(replicate_samples), fragments_(NULL), prev_unichar_id_(-1), debug_level_(debug_level) { }
  • 6.
    The trouble isabout using a yet uninitialized variable 'fontinfo_table_' to initialize 'samples_'. A similar problem in this class is with initializing the fields 'junk_samples_' and 'verify_samples_'. I cannot say for sure what to do with this class. Perhaps it would be sufficient just to move the declaration of 'fontinfo_table_' into the very beginning of the class. Typo in a condition This typo is not clearly seen, but the analyzer is always alert. class ScriptDetector { .... int korean_id_; int japanese_id_; int katakana_id_; int hiragana_id_; int han_id_; int hangul_id_; int latin_id_; int fraktur_id_; .... }; void ScriptDetector::detect_blob(BLOB_CHOICE_LIST* scores) { .... if (prev_id == katakana_id_) osr_->scripts_na[i][japanese_id_] += 1.0; if (prev_id == hiragana_id_) osr_->scripts_na[i][japanese_id_] += 1.0; if (prev_id == hangul_id_) osr_->scripts_na[i][korean_id_] += 1.0; if (prev_id == han_id_) osr_->scripts_na[i][korean_id_] += kHanRatioInKorean; if (prev_id == han_id_) <<<<==== osr_->scripts_na[i][japanese_id_] += kHanRatioInJapanese; .... }
  • 7.
    CppCat's diagnostic message:V581 The conditional expressions of the 'if' operators situated alongside each other are identical. Check lines: 551, 553. osdetect.cpp 553 The very last comparison is very likely to look like this: if (prev_id == japanese_id_) Unnecessary checks There is no need to check the return result of the 'new' operator. If memory cannot be allocated, it will throw an exception. You can, of course, implement a special 'new' operator that returns null pointers, but that is a special case (learn more). Keeping that in mind, we can simplify the following function: void SetLabel(char_32 label) { if (label32_ != NULL) { delete []label32_; } label32_ = new char_32[2]; if (label32_ != NULL) { label32_[0] = label; label32_[1] = 0; } } CppCat's diagnostic message: V668 There is no sense in testing the 'label32_' pointer against null, as the memory was allocated using the 'new' operator. The exception will be generated in the case of memory allocation error. char_samp.h 73 There are 101 other fragments where a pointer returned by the 'new' operator is checked. I don't find it reasonable to enumerate them all here - you'd better launch CppCat and find them yourself. Conclusion Please use static analysis regularly - it will help you save much time to spend on solving more useful tasks than catching silly mistakes and typos. And don't forget to follow me on Twitter: @Code_Analysis. I regularly publish links to interesting articles on C++ there.