An ideal static analyzer, or why ideals are unachievable

An ideal static analyzer, or why ideals are
unachievable
Author: Evgeniy Ryzhkov
Date: 15.03.2012
Being inspired by Eugene Laspersky's post about an ideal antivirus, I decided to write a similar post
about an ideal static analyzer. And meanwhile think how far from being it our PVS-Studio is.
An ideal static analyzer's characteristics
Those who are not familiar with the notion static code analysis, please follow the link. So let me
enumerate the characteristics right away:
• 100% detection of all the types of programming errors;
• 0% false positives;
• high performance - "whooosh, and the code is analyzed completely almost at the same
instance";
• integration with my favorite (i.e. every) IDE; ability to work under my favorite (i.e. every)
operating system; analysis of code in my favorite (i.e. any) programming language;
• free (freeware, open source);
• high-quality and prompt customer support.
Of course, this ideal can never be achieved but it shows the direction towards which companies
developing solutions in this sphere can head.
100% detection of all the types of programming errors
You should understand that none of the static analyzers will ever provide 100% error detection. Why?
Well, if only because some error types are better detected by dynamic analyzers. And it's ridiculous to
try to compete with them in this area. As well as dynamic analyzers cannot compete with static ones
regarding some rule types.
It's difficult to obtain 100% error detection even for diagnostics characteristic of static analysis. First, any
live programming language is constantly developing acquiring new syntax and therefore new ways of
making an error. Second, even old syntax can be with time used by people in a rather unusual way
analyzer developers did not think of.
Finally, a static analyzer doesn't possess knowledge about what a program SHOULD contain, it doesn't
have AI. If there is a phrase in a program "A is equal to B", while the correct one is "A is not equal to B",
static analysis won't help you with that.
That's why the only real way out is to constantly create new diagnostic rules. It will never give you 100%
error detection but will keep you close to it all the time.

0% false positives
Any static analyzer produces false positives, as, in the long run, only the programmer KNOWS what
exactly the code IS TO do. But an analyzer sees what the code really DOES and tries to UNDERSTAND
what it SHOULD do.
Returning to the previous section about "100% detection of all the errors", one can make a naïve
suggestion: "Why, let's detect everything that moves and we'll be happy!" That is, let's detect everything
that looks like an error in the least bit. But this approach is wrong because the number of false positives
will go overboard. And there is an opinion that when a user sees 10 false positives in a row, he/she
closes the tool not to deal with it anymore.
We have the following ways to reduce the number of false positives:
1. Constantly handling existing rules to refine their formulations. For example, if in a test project a
rule "was triggered" 100 times and 50 of them were false positives, refining the rule can reduce
this number to 10. However, you can lose 1 or 2 real warnings, but it's the eternal issue of
making compromises.
2. Refusing rules which are no more relevant. If you only add new rules and never remove (turn
off) obsolete ones, some of your diagnostics lose their relevance with time.
3. Having useful tools to handle false positives. For instance, PVS-Studio provides a mechanism to
suppress false positives. Once marking a message as a false report, you won't see it next time.
High performance
Everybody wants software to work fast but it's not always possible. Usually the code analysis technology
requires more resources than the compiler - because the compiler checks only very crude errors, while
the analyzer's aim is to perform fine analysis. Of course, it needs more data for that. The more the data,
the deeper analysis is and the more interesting errors can be found.
An obvious solution to enhance performance is to provide support of several processor cores when
analyzing the code. It's rather easy to implement in static analyzers: each file is checked separately and
the results are simply combined then.
Less obvious is an attempt to check just a code fragment instead of a whole compilation unit (a file). This
is a very complicated task and, taken generally, it's quite difficult to solve (for any language). You have to
find and "calculate" data types, analyze classes being used and so on. Costs on "extracting" the part you
need to analyze might be even higher than just analyzing the whole code completely.
Integration with my favorite (i.e. every) IDE; ability to work under my
favorite (i.e. every) operating system; analysis of code in my favorite (i.e.
any) programming language
The issue of providing support for a certain operating system, development environment or analyzable
programming language is important in choosing between static analysis tools. To my great surprise,
programmers, being the main users of static analyzers, often cannot understand the difficulties of
implementing support for the whole zoo of operating systems they want. But let's discuss it in due
order.

Supported (analyzable) programming languages
Programming errors detected by code analyzers surely can occur in every programming language and
these errors have common features: in every language programmers forget to initialize variables,
confuse keys when typing a program and so on. But parsing and analysis of a program is VERY different
from language to language.
If some analyzer is announced to support analysis of software in several programming languages, it
means that there are most likely several analysis modules in it too. It can even be hidden from users! I'm
writing this just for people to understand that the phrase: "Why don't you make the SAME but for
C#/PHP/Java?" implies very much work.
Supported operating systems
It's very naïve to think that a code analyzer "just" handles text and therefore can work in any operating
system. Of course, different programming languages are "tied" to the environment to various extents:
some are more, like C++; others are less, like PHP.
Where does this difference come from? The point is that there exist several compilers for large and
powerful languages like C++, considering all their differences and subtleties in the language syntax. The
code written for Windows-based compilers is just a bit yet noticeably different from the code written for
Linux-based compilers. Though this difference is not very crucial from the user code's viewpoint, it might
be important from the viewpoint of a static analyzer - because if the code being analyzed contains key
words that are used in this very compiler, the analyzer needs to be "taught" them. In this sense, support
of one more compiler and support of one more operating system are equal tasks, generally speaking.
Note that this is an easier task for simpler languages than C++.
Thus, supported operating systems include not only platforms an executed file is run on, but the code
for these platforms the analyzer can "understand".
Supported IDE
There are a lot of development tools for different languages. What is important for users is this:
• a static analyzer should be able to integrate with their favorite development environment;
• the tool can be run in automatic mode at night;
• the analyzer should be able to integrate into continuous integration systems;
The last two points are often called "support of command line version" but it has nothing to do with the
command line. No one nowadays actually finds it interesting to watch white letters on the black screen
instead of a conveniently organized report which can be converted into a text file and sent via e-mail or
written into the build system's log.
Support of different IDE's is a difficult, effort-intensive task, as each IDE imposes certain restrictions on
their plugins. These restrictions often vary in different systems.
Free (freeware, open source) and high-quality customer support
I've united two sections into one because they are closely connected.

Static code analysis tools refer to the software type for which quality and continuous support are very
important. Yes, there are a few tools distributed for free, but I believe they will never reach the market
leaders (Coverity, Klocwork, Parasoft).
Generally speaking, a static analysis tool can become free and open-source if the developer company is
purchased by some giant like Google, Microsoft or Intel, but this is a special case.
Static analysis tools are usually sold according to the model of annually renewable license. Some users
might not like it, but I will try to explain why this scheme is the best. And please forgive me if you have
entered the "Free" section and now are reading about licensing schemes.
As I've already said, customer support is very important for static analysis tools. In the field of static
analysis, support implies, first of all, cases when the analyzer cannot parse user code (because of
complex C++ templates, non-standard compiler extensions, etc.). In these cases you need to promptly
(during several days) improve the analyzer so that it can parse the customer's code. User support also
includes aid in integrating the tool into their development process. Well, implementation of customer
requests that makes use of the tool more convenient is also necessary.
All this costs money. That's why you cannot sell a license once and support your users for free for the
rest of your life.
One could sell new major-releases, for example, versions v3, v4, v5... What is bad about this scheme is
that it makes the developer "hold" new cool capabilities of the tool till the next major-version instead of
releasing them right away as soon as they are ready.
Thus, it appears that annual license renewal is the best way. Meanwhile, some developer companies set
the renewal price at the 100% of the initial price, while others set a lower price (making a discount for
renewal). Regarding the latter case, it can be explained this way: the first year's price includes additional
costs on teaching the customer to work with the tool.
So, it appears that a quality tool with quality support cannot be free, if only it is not being developed by
a company-giant, but in this case you can forget about targeted individual customer support.
Conclusion
In this article I've tried to show you what characteristics an ideal static code analysis tool should possess;
how users want it to look. And it is users, of course, who decide how much this or that tool really
corresponds to this ideal.

An ideal static analyzer, or why ideals are unachievable

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (6)

Similar to An ideal static analyzer, or why ideals are unachievable

Similar to An ideal static analyzer, or why ideals are unachievable (20)

Recently uploaded

Recently uploaded (20)

An ideal static analyzer, or why ideals are unachievable