Difference of code analysis approaches in compilers and specialized tools


Published on

Compilers and third-party static code analyzers have one common task: to detect dangerous code fragments. However, there is a great difference in the types of analysis performed by each kind of these tools. I will try to show you the differences between these two approaches (and explain their source) by the example of the Intel C++ compiler and PVS-Studio analyzer.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Difference of code analysis approaches in compilers and specialized tools

  1. 1. Difference of code analysis approaches incompilers and specialized toolsAuthor: Andrey KarpovDate: 01.11.2010Compilers and third-party static code analyzers have one common task: to detect dangerous code partyfragments. However, there is a great difference in the types of analysis performed by each kind of thesetools. I will try to show you the differences between these two approaches (and explain their source) bythe example of the Intel C++ compiler and PVS l PVS-Studio analyzer.This time, it is the Notepad++ 5.8.2 project that we chose for the test test.Notepad++At first a couple of words about the project we have chosen Notepad++ is an open-source and free chosen.source code editor that supports many languages and appears a substitute for the standard Notepad. Itworks in the Microsoft Windows environment and is released under the GPL license What I liked about license.this project is that it is written in C++ and has a small size - just 73000 lines of code. But what is the mostimportant, this is a rather accurate project - it is compiled by presence of the /W4 switch in the project projectssettings and /WX switch that makes analyzers treat each warning as an er error.Static analysis by compilerNow lets study the analysis procedure from the viewpoints of a compiler and a separate specialized stool. The compiler is always inclined to generating warnings after processing only very small local codefragments. This preference is a consequence of very strict performance requirements imposed on thecompiler. It is no coincidence that there exist tools of distributed project build. The time needed tocompile medium and large projects is a significant factor influencing the choice of developmentmethodology. So if developers can get a 5% performance gain out of the compiler, they will do it it.Such optimization makes the compiler solider and actually such steps as preprocessing, building AST andcode generation are not so distinct. For instance, I may say relying on some indirect signs that Visual C++uses different preprocessor algorithms when compiling projects and generating preprocessed "*.i" files.The compiler also does not need (it is even harmful for it) to store the whole AST. Once the code forsome particular nodes is generated and they are no more needed, they get destroyed right away. Duringthe compilation process, AST may never exist in the full form. There is simply no need for that - we parsea small code fragment, generate the code and go further. This saves memory and cache and thereforeincreases speed.
  2. 2. The result of this approach is "locality" of warnings. The compiler consciously saves on variousstructures that could help it detect higher-level errors. Lets see in practice what local warnings Intel C++will generate for the Notepad++ project. Let me remind you that the Notepad++ project is built with theVisual C++ compiler without any warnings with the /W4 switch enabled. But the Intel C++ compilercertainly has a different set of warnings and I also set a specific switch /W5 [Intel C++]. Moreover, Iwould like to have a look at what the Intel C++ compiler calls "remark".Lets see what kinds of messages we get from Intel C++. Here it found four similar errors where theCharUpper function is being handled (SEE NOTE AT THE END). Note the "locality" of the diagnosis - thecompiler found just a very dangerous type conversion. Lets study the corresponding code fragment:wchar_t *destStr = new wchar_t[len+1];...for (int j = 0 ; j < nbChar ; j++){ if (Case == UPPERCASE) destStr[j] = (wchar_t)::CharUpperW((LPWSTR)destStr[j]); else destStr[j] = (wchar_t)::CharLowerW((LPWSTR)destStr[j]);}Here we see strange type conversions. The Intel C++ compiler warns us: "#810: conversion from"LPWSTR={WCHAR={__wchar_t} *}" to "__wchar_t" may lose significant bits". Lets look at theCharUpper functions prototype.LPTSTR WINAPI CharUpper( __inout LPTSTR lpsz);The function handles a string and not separate characters at all. But here a character is cast to a pointerand some memory area is modified by this pointer. How horrible.Well, actually this is the only horrible issue detected by Intel C++. All the rest are much more boring andare rather inaccurate code than error-prone code. But lets study some other warnings too.The compiler generated a lot of #1125 warnings:"#1125: function "Window::init(HINSTANCE, HWND)" is hidden by "TabBarPlus::init" -- virtual functionoverride intended?"
  3. 3. These are not errors but just poor naming of functions. We are interested in this message for a differentreason: although it seems to involve several classes for the check, the compiler does not keep specialdata - it must store diverse information about base classes anyway, that is why this diagnosis isimplemented.The next sample. The message "#186: pointless comparison of unsigned integer with zero" is generatedfor the meaningless comparisons:static LRESULT CALLBACK hookProcMouse( UINT nCode, WPARAM wParam, LPARAM lParam){ if(nCode < 0) { ... return 0; }...}The "nCode < 0" condition is always false. It is a good example of good local diagnosis. You may easilyfind an error this way.Lets consider the last warning by Intel C++ and get finished with it. I think you have understood theconcept of "locality".void ScintillaKeyMap::showCurrentSettings() { int i = ::SendDlgItemMessage(...); ... for (size_t i = 0 ; i < nrKeys ; i++) { ... }}Again we have no error here. It is just poor naming of variables. The "i" variable has the "int" type atfirst. Then a new "i" variable of the "size_t" type is defined in the "for()" operator and is being used fordifferent purposes. At the moment when "size_t i" is defined, the compiler knows that there alreadyexists a variable with the same name and generates the warning. Again, it did not require the compilerto store any additional data - it must remember anyway that the "int i" variable is available until the endof the functions body.
  4. 4. Third-party static code analyzersNow lets consider specialized static code analyzers. They do not have such severe speed restrictionssince they are launched ten times less frequently than compilers. The speed of their work might get tensof times slower than code compilation but it is not crucial: for instance, the programmer may work withthe compiler at day and launch a static code analyzer at night to get a report about suspicious fragmentson the morning. It is quite a reasonable approach.While paying with slow-down for their work, static code analyzers can store the whole code tree,traverse it several times and store a lot of additional information. It lets them find "spreaded" and high-level errors.Lets see what the PVS-Studio static analyzer can find in Notepad++. Note that I am using a pilot versionthat is not available for download yet. We will present the new free general-purpose rule set in 1-2months within the scope of PVS-Studio 4.00.Surely, the PVS-Studio analyzer finds errors that may be referred to "local" like in case of Intel C++. Thisis the first sample:bool _isPointXValid;bool _isPointYValid;bool isPointValid() { return _isPointXValid && _isPointXValid;};The PVS-Studio analyzer informs us: "V501: There are identical sub-expressions to the left and to theright of the && operator: _isPointXValid && _isPointXValid".I think the error is clear to you and we will not dwell upon it. The diagnosis is "local" because it isenough to analyze one expression to perform the check.Here is one more local error causing incomplete clearing of the _iContMap array:#define CONT_MAP_MAX 50int _iContMap[CONT_MAP_MAX];...DockingManager::DockingManager(){ ... memset(_iContMap, -1, CONT_MAP_MAX); ...}
  5. 5. Here we have the warning "V512: A call of the memset function will lead to a buffer overflow orunderflow". This is the correct code:memset(_iContMap, -1, CONT_MAP_MAX * sizeof(int));And now lets go over to more interesting issues. This is the code where we must analyze two branchessimultaneously to see that there is something wrong:void TabBarPlus::drawItem( DRAWITEMSTRUCT *pDrawItemStruct){ ... if (!_isVertical) Flags |= DT_BOTTOM; else Flags |= DT_BOTTOM; ...}PVS-Studio generates the message "V523: The then statement is equivalent to the else statement". Ifwe review the code nearby, we may conclude that the author intended to write this text:if (!_isVertical) Flags |= DT_VCENTER;else Flags |= DT_BOTTOM;And now get brave to meet a trial represented by the following code fragment:void KeyWordsStyleDialog::updateDlg(){ ... Style & w1Style = _pUserLang->_styleArray.getStyler(STYLE_WORD1_INDEX); styleUpdate(w1Style, _pFgColour[0], _pBgColour[0], IDC_KEYWORD1_FONT_COMBO, IDC_KEYWORD1_FONTSIZE_COMBO, IDC_KEYWORD1_BOLD_CHECK, IDC_KEYWORD1_ITALIC_CHECK, IDC_KEYWORD1_UNDERLINE_CHECK);
  6. 6. Style & w2Style = _pUserLang->_styleArray.getStyler(STYLE_WORD2_INDEX); styleUpdate(w2Style, _pFgColour[1], _pBgColour[1], IDC_KEYWORD2_FONT_COMBO, IDC_KEYWORD2_FONTSIZE_COMBO, IDC_KEYWORD2_BOLD_CHECK, IDC_KEYWORD2_ITALIC_CHECK, IDC_KEYWORD2_UNDERLINE_CHECK); Style & w3Style = _pUserLang->_styleArray.getStyler(STYLE_WORD3_INDEX); styleUpdate(w3Style, _pFgColour[2], _pBgColour[2], IDC_KEYWORD3_FONT_COMBO, IDC_KEYWORD3_FONTSIZE_COMBO, IDC_KEYWORD3_BOLD_CHECK, IDC_KEYWORD3_BOLD_CHECK, IDC_KEYWORD3_UNDERLINE_CHECK); Style & w4Style = _pUserLang->_styleArray.getStyler(STYLE_WORD4_INDEX); styleUpdate(w4Style, _pFgColour[3], _pBgColour[3], IDC_KEYWORD4_FONT_COMBO, IDC_KEYWORD4_FONTSIZE_COMBO, IDC_KEYWORD4_BOLD_CHECK, IDC_KEYWORD4_ITALIC_CHECK, IDC_KEYWORD4_UNDERLINE_CHECK); ...}I can say that I am proud of our analyzer PVS-Studio that managed to find an error here. I think you havehardly noticed it or just have skipped the whole fragment to see the explanation. Code review is almosthelpless before this code. But the static analyzer is patient and pedantic: "V525: The code containing thecollection of similar blocks. Check items 7, 7, 6, 7 in lines 576, 580, 584, 588".I will abridge the text to point out the most interesting fragment:styleUpdate(... IDC_KEYWORD1_BOLD_CHECK, IDC_KEYWORD1_ITALIC_CHECK, ...);
  7. 7. styleUpdate(... IDC_KEYWORD2_BOLD_CHECK, IDC_KEYWORD2_ITALIC_CHECK, ...);styleUpdate(... IDC_KEYWORD3_BOLD_CHECK, !!! IDC_KEYWORD3_BOLD_CHECK !!!, ...);styleUpdate(... IDC_KEYWORD4_BOLD_CHECK, IDC_KEYWORD4_ITALIC_CHECK, ...);This code was most likely written by the Copy-Paste method. As a result, it isIDC_KEYWORD3_BOLD_CHECK which is used instead of IDC_KEYWORD3_ITALIC_CHECK. The warninglooks a bit strange reporting about numbers 7, 7, 6, 7. Unfortunately, it cannot generate a clearermessage. These numbers arise from macros like these:#define IDC_KEYWORD1_ITALIC_CHECK (IDC_KEYWORD1 + 7)#define IDC_KEYWORD3_BOLD_CHECK (IDC_KEYWORD3 + 6)The last cited sample is especially significant because it demonstrates that the PVS-Studio analyzerprocessed a whole large code fragment simultaneously, detected repetitive structures in it and managedto suspect something wrong relying on heuristic method. This is a very significant difference in the levelsof information processing performed by compilers and static analyzers.Some figuresLets touch upon one more consequence of "local" analysis performed by compilers and more globalanalysis of specialized tools. In case of "local analysis", it is difficult to make it clear if some issue is reallydangerous or not. As a result, there are ten times more false alarms. Let me explain this by example.When we analyzed the Notepad++ project, PVS-Studio generated only 10 warnings. 4 messages out ofthem indicated real errors. The result is modest, but general-purpose analysis in PVS-Studio is onlybeginning to develop. It will become one of the best in time.When analyzing the Notepad++ project with the Intel C++ compiler, it generated 439 warnings and 3139remarks. I do not know how many of them point to real errors, but I found strength to review some partof these warnings and saw only 4 real issues related to CharUpper (see the above description).3578 messages are too many for a close investigation of each of them. It turns out that the compileroffers me to consider each 20-th line in the program (73000 / 3578 = 20). Well, come on, its not serious.When you are dealing with a general-purpose analyzer, you must cut off as much unnecessary stuff aspossible.Those who tried the Viva64 rule set (included into PVS-Studio) may notice that it produces the samehuge amount of false alarms. But we have a different case there: we must detect all the suspicious type
  8. 8. conversions. It is more important not to miss an error than not to produce a false alarm. Besides, thetools settings provide a flexible filtering of false alarms.UPDATE: NoteIt turned out that I had written a wrong thing here. There is no error in the sample with CharUpperWbut nobody corrected me. I noticed it myself when I decided to implement a similar rule in PVS-Studio.The point is that CharUpperW can handle both strings and individual characters. If the high-order part ofa pointer is zero, the pointer is considered a character and not pointer any more. Of course, the WIN APIinterface in this place disappointed me by its poorness, but the code in Notepad++ is correct.By the way, it turns out now that Intel C++ has not found any errors at all.