An unusual bug in Lucene.Net

An unusual bug in Lucene.Net
Author: Ilya Ivanov
Date: 14.03.2016
Listening to stories about static analysis, some programmers say that they don't really need it, as their
code is entirely covered by unit tests, and that's enough to catch all the bugs. Recently I have found a
bug that is theoretically possible to find using unit tests, but if you are not aware that it's there, it's
almost unreal to write such a test to check it.
Introduction
Lucene.Net is a port of the Lucene search engine library, written in C#, and targeted at .NET runtime
users. The source code is open and available on the project website https://lucenenet.apache.org/.
The analyzer managed to detect only 5 suspicious fragments due to the slow pace of development, small
size and the fact that the project is widely used in other projects for full-text search [1].
To be honest, I didn't expect to find more bugs. One of these errors seemed especially interesting to me,
so I decided to tell our readers about it in our blog.
About the bug found
We have a diagnostic, V3035, about an error when instead of += a programmer may mistakenly write
=+, where + is a unary plus. When I was writing it by analogy with the V588 diagnostic, designed for C++,
I was thinking - can a programmer really make the same error, coding in C#? It could be understandable
in C++ - people use various text editors instead of IDE, and a typo can be easily left unnoticed. But typing
text in Visual Studio, which automatically aligns the code once a semicolon is put, is it possible to
overlook the misprint? It turns out that it is. Such a bug was found in Lucene.Net. It is of great interest to

us, mostly because it's rather hard to detect it using means other than static analysis. Let's take a look at
the code:
protected virtual void Substitute( StringBuilder buffer )
{
substCount = 0;
for ( int c = 0; c < buffer.Length; c++ )
{
....
// Take care that at least one character
// is left left side from the current one
if ( c < buffer.Length - 1 )
{
// Masking several common character combinations
// with an token
if ( ( c < buffer.Length - 2 ) && buffer[c] == 's' &&
buffer[c + 1] == 'c' && buffer[c + 2] == 'h' )
{
buffer[c] = '$';
buffer.Remove(c + 1, 2);
substCount =+ 2;
}
....
else if ( buffer[c] == 's' && buffer[c + 1] == 't' )
{
buffer[c] = '!';
buffer.Remove(c + 1, 1);
substCount++;
}
....
}
}
}

There is also a class GermanStemmer, which cuts off suffixes of german words to mark out a common
root. It works in the following way: first, the Substitute method replaces different combinations of
letters with other symbols, so that they are not confused with a suffix. There are such substitutions as -
'sch' to '$', 'st' to '!' (you can see it in the code example). At the same time the number of characters by
which such changes will shorten the word, is stored in the substCount variable. Further on, the Strip
method cuts off extra suffixes and finally, the Resubstitute method does the reverse substitution: '$' to
'sch', '!' to 'st'. For instance, if we have a word "kapitalistischen" (capitalistic), the stemmer will do the
following: kapitalistischen => kapitali!i$en (Substitute) => kapitali!i$ (Strip) => kapitalistisch
(Resubstitute).
Because of this typo, during the substitution of 'sch' with '$', the substCount variable will be assigned
with 2, instead of adding 2 to substCount. This error is really hard to find using methods other than
static analysis. That's the answer to those who think "Do I need static analysis, if I have unit-tests?"
Thus, to catch such a bug with the help of unit tests one should test Lucene.Net on German texts, using
GermanStemmer; the tests should index a word containing the 'sch' combination, and one more letter
combination, for which the substitution will be performed. At the same time it should be present in the
word before 'sch', so that the substCount will be not zero by the time the expression substCount =+ 2 is
executed. Quite an unusual combination for a test, especially if you don't see the bug.
Conclusion:
Unit tests and static analysis need not exclude, but rather complement, each other as methods of
software development [2]. I suggest downloading PVS-Studio static analyzer, and finding those bugs that
weren't detected by means of unit-testing.
Additional links
1. Andrey Karpov. Reasons why the error density is low in small programs
2. Andrey Karpov. How to complement TDD with static analysis

An unusual bug in Lucene.Net

Recommended

Recommended

More Related Content

Similar to An unusual bug in Lucene.Net

Similar to An unusual bug in Lucene.Net (20)

Recently uploaded

Recently uploaded (20)

An unusual bug in Lucene.Net