Static analysis and regular expressions


Published on

I develop the PVS-Studio static code analyzer intended for analyzing C/C++ software. After we implemented general analysis in PVS-Studio 4.00, we received a lot of responses, both positive and negative. By the way, you are welcome to download a new version of PVS-Studio where we have fixed a lot of errors and defects thanks to users who told us about them.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Static analysis and regular expressions

  1. 1. Static analysis and regular expressionsAuthor: Andrey KarpovDate: 09.12.2010I develop the PVS-Studio static code analyzer intended for analyzing C/C++ software. After weimplemented general analysis in PVS-Studio 4.00, we received a lot of responses, both positive andnegative. By the way, you are welcome to download a new version of PVS-Studio where we have fixed alot of errors and defects thanks to users who told us about them.While discussing PVS-Studio 4.00, the question was again raised if we could implement most checksusing regular expressions and if we actually complicate the matter suggesting that we must necessarilybuild and handle a parse tree during analysis. This question arises not for the first time, so I decided towrite an article to explain why it is a very bad idea to try to use regular expressions for C/C++ codeanalysis.Those familiar with the compilation theory certainly understand that the C++ language can be parsedonly relying on grammatics and not regular expressions. But most programmers are not familiar withthis theory and they continue to tell us about using regular expressions to search for errors in softwarecode over and over again.Let me say right away that we can find some issues using regular expressions. There are even severalstatic analyzers that use this principle. But their capabilities are very restricted and mostly come tomessages like "There is the "strcpy" function being used, youd better replace it with a safer one".Having thought it over how to tell the community about lameness of the regular expression method, Idecided to do the following simple thing. I will take the first ten diagnostic messages of general analysisimplemented in PVS-Studio and show by the example of each of them what restrictions the regularexpression method involves.Diagnosis 0Once I started describing V501, I recalled that none of the analysis types would provide me withsufficient information until #defines remain unexpanded. The error might hide inside the macro but itwill remain an error all the same. It is rather simple to create a preprocessed file, so assume we alreadyhave i-files. Now we encounter the first trouble - we must determine which code fragments refer tosystem files and which refer to user code. If we analyze system library functions, it will significantlyreduce the speed of analysis and cause a lot of unnecessary diagnostic messages. Thus, if we use regularexpressions, we must parse the following lines:#line 27 "C:Program Files (x86)Microsoft Visual Studio 8VCatlmfcincludeafx.h"#line 1008 ".mytestfile.cpp"and understand which of them refer to our program and which refer to Visual Studio. But thats not thehalf of it: we must also implement relative reading of lines inside i-files since we must generate not theabsolute number of the line with the error in the preprocessed i-file but the number of the line in ournative c/cpp-file we are analyzing.
  2. 2. So, we have not even started but already get a whole lot of difficulties.Diagnosis 1V501. There are identical sub-expressions to the left and to the right of the foo operator.In order not to overload the text, I suggest that the readers go by the link and read the description ofthis error and samples. The point of this rule is to detect constructs of this type:if (X > 0 && X > 0)At first sight, we could easily find such constructs using a regular expression when identical expressionsstand to the left and to the right of operators &&, ||, ==, etc. For example: we search for the &&operator. If there is something looking identical in parentheses to the right and to the left of &&, wecertainly have an error. But it wont work because one could write it this way:if (A == A && B)The error is still here but there are different expressions to the left and to the right of ==. It means thatwe must introduce the notion of precedence of operators. Then we must cut off boundaries on lower-priority operators such as && if we have ==; and vice versa: if it is &&, then we must captureoperators == to find the error for this case on approaching the limiting parentheses:if (A == 0 && A == 0)In the same way, we must provide for logic for all the versions of operators with different priorities. Yes,by the way - you cannot fully rely on parentheses too because you may encounter cases like this:if ( ( == A && ( == B )b = X > 0 && X > 0;It is very difficult to provide for all the possible ways using regular expressions. We will have too many ofthem with a lot of exceptions. And still it wont be safe since we will not be sure that all the possibleconstructs have been taken into account.Now compare this whole stuff with the elegance with which I can find this error having a syntax tree. If Ihave found operators &&, ==, ||, etc., I only have to compare the left and the right branches of the treeto each other. I will do this in the following way:if (Equal(left, right)){ // Error!}That is all. You dont have to think of operators priorities, you dont have to fear that you will encountera bracket in this text: b = ( == x && x == );. You can simply compare the left and the right treebranches.Diagnosis 2
  3. 3. V502. Perhaps the ?: operator works in a different way than it was expected. The ?: operator has alower priority than the foo operator.This rule searches for confusion concerning operators priorities (see the error description for details).We must detect a text like this:int a;bool b;int c = a + b ? 0 : 1;Lets leave the question about operators priorities aside for now: regular expressions appear too poorwhen used for this purpose. But what is worse, you must know the VARIABLES TYPE for this and manyother rules.You must derive the type of each variable. You must force your way through the maze of typedef. Youmust look into classes to understand what vector<int>::size_type is. You must take scopes intoconsideration as well as different using namespace std;. You must even derive the type of the X variablefrom the expression: "auto X = 1 + 2;" in C++0x.The question is how can we do all that using regular expressions? The answer is no way. Regularexpressions are perpendicular to this task. You must either write a complicated mechanism of typederivation, i.e. create a syntactical code analyzer, or have regular expressions without knowing types ofvariables and expressions.The conclusion is: if we use regular expressions to handle a C/C++ application, we do not know types ofvariables and expressions. Note this great limitation.Diagnosis 3V503. This is a nonsensical comparison: pointer < 0.This rule is very simple. Comparison of a pointer with zero using < and > looks suspicious. For example:CMeshBase *pMeshBase = getCutMesh(Idx);if (pMeshBase < 0) return NULL;Refer to the error description to learn how we got this code.To implement this diagnosis, we must only know the type of the pMeshBase variable. It was explainedabove why it is impossible.This diagnosis cannot be implemented relying on regular expressions.Diagnosis 4V504. It is highly probable that the semicolon ; is missing after return keyword.void Foo();
  4. 4. void Foo2(int *ptr){ if (ptr == NULL) return Foo(); ...}We could well diagnose constructs of this type using regular expressions. But we would have too manyfalse alarms. We are interested only in those cases when the function returns void. Well, we could find itout using regular expressions either. But it will not be very clear where the function starts and ends. Tryyourself to invent a regular expression to find the functions start. Trust me, you will like this task,especially if you understand that one could write a stuff like this:int Foo(){ ... char c[] = "void MyFoo(int x) {" ; ...}If we have a complete syntax tree with diverse information, everything becomes much simpler. You mayfind out the type of the returned function this way (the sample is taken right out of PVS-Studio):SimpleType funcReturnType;EFunctionReturnType fType;if (!env->LookupFunctionReturnType(fType, funcReturnType)) return;if (funcReturnType != ST_VOID) return;Diagnosis 5V505. The alloca function is used inside the loop. This can quickly overflow stack.Yes, we could try to implement this rule relying on regular expressions.
  5. 5. But I wouldnt try to find out where the loop starts and ends for one could think up so many funnysituations with curly brackets in comments and strings.{ for (int i = 0; i < 10; i++) { //A cool comment. There you are { - try to solve it. :) char *x = "You must be careful here too {"; } p = _alloca(10); // Are we inside the loop or not?}Diagnosis 6V506. Pointer to local variable X is stored outside the scope of this variable. Such a pointer will becomeinvalid.We must handle variables scope to detect these errors. We must also know types of variables.This diagnosis cannot be implemented relying on regular expressions.Diagnosis 7V507. Pointer to local array X is stored outside the scope of this array. Such a pointer will becomeinvalid.This diagnosis cannot be implemented relying on regular expressions.Diagnosis 8V508. The use of new type(n) pattern was detected. Probably meant: new type[n].It is good to detect misprints of this kind:float *p = new float(10);Everything looks simple and it seems we could implement this diagnosis using regular expressions if weknew the type of the object being created. No way. Once you change the text a bit, regular expressionsbecome useless:typedef float MyReal;...MyReal *p = new MyReal(10);This diagnosis cannot be implemented relying on regular expressions.Diagnosis 9
  6. 6. V509. The throw operator inside the destructor should be placed within the try..catch block. Raisingexception inside the destructor is illegal.Yes, we could try to make this check using regular expressions. Destructors are usually small functionsand we will hardly meet any troubles with curly brackets there.But you will have to sweat over regular expressions to find the destructor function, its beginning andend and find out if it contains throw which is caught in catch. Do you imagine the whole amount ofwork? Can you do a thing like that?Well, I can. This is how I made it in a very smart way in PVS-Studio (the rule is given in full):void ApplyRuleG_509(VivaWalker &walker, Environment *env, const Ptree *srcPtree){ SimpleType returnType; EFunctionReturnType fType; bool res = env->LookupFunctionReturnType(fType, returnType); if (res == false || returnType != ST_UNKNOWN) return; if (fType != DESTRUCTOR) return; ptrdiff_t tryLevel = OmpUtil::GetLevel_TRY(env); if (tryLevel != -1) return; string error = VivaErrors::V509(); walker.AddError(error, srcPtree, 509, DATE_1_SEP_2010(), Level_1);}Diagnosis 10V510. The Foo function is not expected to receive class-type variable as N actual argument.This rule concerns passing classes of std::string type and the like as arguments into functions of printftype. We need types. That is, this diagnosis cannot be implemented relying on regular expressions aswell.Summary
  7. 7. I hope I have made the situation with regular expressions, syntax trees and static code analysis clearerto you. Thank you for your attention. Once again I ask you to download and try PVS-Studio. I would alsoappreciate if you ask questions but I am not intending to get into debates about what regularexpressions can give us and what they cannot. It is not interesting. They do allow us to get much, butthey do not allow us to get even more. C++ can be successfully parsed only using the grammaticsmathematical apparatus.