Data flow analysis is a technology for source code analysis, widely used in various development tools: compilers, linters, IDE. We'll talk about it exemplifying with design of a static analyzer. The talk covers classification and various kinds of data flow analysis, neighbouring technologies supporting each other, obstacles arising during development, surprises from C++ language when one tries to analyze the code. In this talk, some errors, detected in real projects using this technology, are shown in detail.
How data flow analysis operates in a static code analyzer
1. How data flow analysis
operates in a
static code analyzer
Pavel Belikov
C++ Developer, PVS-Studio
belikov@viva64.com
2. / 50
PVS-Studio
• Static analyzer for C, C++, C# code
• It works on Windows, Linux, macOS
• Plugin for Visual Studio
• Integrates into SonarQube and Jenkins
• Quick start (Standalone, pvs-studio-analyzer)
2
3. / 50
Contents:
• Types and objectives of Data Flow Analysis
• Analysis of conditions
• Analysis of loops
• Symbolic execution
• Examples of errors found in real projects
3
4. / 50
What is data flow analysis
•Calculate a set of values for expression or its properties
•Numbers
•Null/non-null pointer
•Strings
•The size and contents of containers/optional
• Determine state of variables
4
5. / 50
The main objectives
• Set of values must be a superset of real values
• Time is limited
• Number of false positives must be minimized
5
8. / 50
The basic equation
• b – a code block
• in/out - a state of variables when entering and exiting the block
• trans - a function that transforms the state of variables in the block
• join - a function that merges the state of variables in different paths of execution
8
10. / 50
Example
int a = 3; in = {}, out = {a=3}
if (something)
a = 4; in = {a=3}, out = {a=4}
std::cout << a; in = {a=3}∪{a=4}={a=[3;4]}
10
11. / 50
Flow sensitivity
• Flow-sensitive analysis depends on the order of expressions in code
• An example of a flow-insensitive analysis: searching for modified variables in a block
• A way for code traversal is needed
11
12. / 50
Flow sensitivity
• Data Flow works with Control Flow Graph
• In practice you can use AST (abstract syntax tree)
• AST is simpler and more understandable to most developers
• There are more tools for AST, parsers can generate AST
• CFG can be simulated on top of the AST
12
13. / 50
Flow sensitivity
• Forward analysis
• Pass the information to the block B from the preceding blocks
• It suits well for calculating the values of variables and determining reaching
definitions
• Backward analysis
• Pass the information from the block B to the preceding blocks
• It suits well for live variable analysis
13
14. / 50
Example of backward analysis
__private_extern__ void
YSHA1Transform(u_int32_t state[5],
const unsigned char buffer[64])
{
u_int32_t a, b, c, d, e;
....
state[0] += a;
state[1] += b;
state[2] += c;
state[3] += d;
state[4] += e;
/* Wipe variables */
a = b = c = d = e = 0;
}
XNU kernel
V1001 CWE-563 The 'a' variable is assigned but is not used until the end of the function. sha1mod.c 120
14
15. / 50
Example of forward analysis
• Reaching definitions
• REACH - a set of variable definitions that can be read in the expression S
• GEN - new definitions
• KILL - "killed" definitions
15
16. / 50
Example of forward analysis
ParseResult ParseOption (string option, ref string[] args , CompilerSettings settings) {
AssemblyResource res = null; GEN={res0}
switch (s.Length) {
case 1:
res = new AssemblyResource (s[0], Path.GetFileName (s[0])); GEN={res1}, KILL={res0}
break;
case 2:
res = new AssemblyResource (s[0], s[1]); GEN={res2}, KILL={res0}
break;
default:
report.Error (-2005, "Wrong number of arguments for option '{0}'", option);
return ParseResult.Error;
}
if (res != null) { ... } REACH={res1, res2}
}
ILSpy
V3022 Expression 'res != null' is always true. settings.cs 827
16
17. / 50
Must vs may
• Must
• Data flow fact must be true for all paths
• It’s expressed through the intersection of sets
• May
• Fact should be correct at least for one path
• It is expressed through the union of sets
17
18. / 50
Must vs may
• Static analysis often works with may
• No one writes
int *p = nullptr;
if (something) p = nullptr;
else if (something_else) p = nullptr;
else p = nullptr;
*p = 42;
18
19. / 50
Must vs may
STDMETHODIMP sdnAccessible::get_computedStyle(
BSTR __RPC_FAR* aStyleProperties,
BSTR __RPC_FAR* aStyleValues,
unsigned short __RPC_FAR* aNumStyleProperties)
{
if (!aStyleProperties || aStyleValues || !aNumStyleProperties)
return E_INVALIDARG;
....
aStyleValues[realIndex] = ::SysAllocString(value.get());
....
}
Mozilla Thunderbird
V522 Dereferencing of the null pointer ‘aStyleValues’ might take place. sdnaccessible.cpp 252
19
20. / 50
Path-sensitive analysis
• May in one of the paths is not enough
• What if the path is impossible?
• We need to analyze the conditions!
20
21. / 50
Path-sensitive analysis
enum {
Runesync = 0x80,
Runeself = 0x80,
};
char* utfrune(const char *s, int c) {
....
if (c < Runesync) return strchr(s, c); // c: then [INT_MIN; 0x79] else [0x80; INT_MAX]
for(;;) {
c1 = *(unsigned char*)s;
if (c1 < Runeself) { // c1: then [0; 0x79]
if (c1 == 0) return 0; // c1: then 0 else [1; 0x79]
if (c1 == c) return (char*)s; // if ([1; 0x79] == [0x80; INT_MAX])
....
}
....
}
return 0;
}
RE2 V547 CWE-570 Expression 'c1 == c' is always false. rune.cc 247
21
23. / 50
Short circuit
23
x = [0; INT_MAX]
x = [INT_MIN; -1]
if ( x >= 0 && x <= 10 ) {
} else {
}
24. / 50
Short circuit
24
x = [0; INT_MAX]
x = [INT_MIN; -1]
x = [0; 10]
x = [11; INT_MAX]
then: x = [0; 10]
else: x = [INT_MIN; -1] ∪ [11; INT_MAX]
x = [0; INT_MAX]
x = [INT_MIN; -1]
if ( x >= 0 && x <= 10 ) {
} else {
}
26. / 50
Join problem
int *p;
if (condition) {
p = new int;
} else {
p = nullptr;
}
// p - nullable
if (condition) {
*p = 42; // null dereference?
}
26
27. / 50
Join problem
• We lose the information when we unite the paths
• It is better to postpone the merging of states for as long as possible
• But there is a problem with path explosion
27
28. / 50
Join problem
int *p;
if (condition) {
p = new int; // p = non null if condition
} else {
p = nullptr; // p = null if !condition
}
// p = non null if condition
// ∪ null if !condition
if (condition) {
// p = non null
*p = 42;
}
28
29. / 50
Join problem
int arr[4];
int a, b;
if (condition) {
a = 1;
b = 2;
} else {
a = 2;
b = 1;
}
return arr[a + b]; // a = 1 if condition ∪ 2 if !condition
// b = 2 if condition ∪ 1 if !condition
// a + b = 3 if condition ∪ 3 if !condition
// a + b = 3
29
31. / 50
Try-catch
try {
SomeClass c(someFunction(), 42);
c.foo();
return c + “abc”;
} catch (...) {
}
• call of someFunction()
• constructor of c variable
• call of foo() method
• constructors of temporary objects
• operator +
• constructor for returned object
• destructors for temporary objects
• destructor of c variable
31
32. / 50
Loop analysis
•In general case, it is difficult and slow to analyze
•Analyze the first iteration separately
•"Kill" all new definitions of variables after a loop
32
Are you stuck in
an infinite loop?
YesNo
33. / 50
Loop invariants
public final R getSomeBuildWithWorkspace() {
int cnt=0; // <= variable definition outside of the loop
for (R b = getLastBuild(); cnt<5 && b!=null; b=b.getPreviousBuild()) {
FilePath ws = b.getWorkspace();
if (ws!=null) return b;
}
return null;
}
Jenkins
V6022 Expression 'cnt < 5' is always true AbstractProject.java 557
33
34. / 50
The first iteration
void Measure::read(XmlReader& e, int staffIdx) {
Segment* segment = 0;
....
while (e.readNextStartElement()) {
const QStringRef& tag(e.name());
if (tag == "move")
e.initTick(e.readFraction().ticks() + tick());
....
else if (tag == "sysInitBarLineType") {
....
segment = getSegmentR(SegmentType::BeginBarLine, 0); // !!!
segment->add(barLine); // <= OK
}
....
else if (tag == "Segment")
segment->read(e); // <= ERROR
....
}
}
MuseScore V522 Dereferencing of the null pointer 'segment' might take place. measure.cpp 2220
34
35. / 50
Loop control flow
SkOpSpan* SkOpContour::undoneSpan() {
SkOpSegment* testSegment = &fHead;
bool allDone = true;
do {
if (testSegment->done()) {
continue;
}
allDone = false;
return testSegment->undoneSpan();
} while ((testSegment = testSegment->next()));
if (allDone) {
fDone = true;
}
return nullptr;
}
Skia Graphics Engine
V547 CWE-571 Expression 'allDone' is always true. skopcontour.cpp 43
35
36. / 50
Loop control flow
SkOpSpan* SkOpContour::undoneSpan() {
SkOpSegment* testSegment = &fHead;
bool allDone = true;
do {
if (testSegment->done()) {
continue;
}
allDone = false; // <= we don’t take into account this path
return testSegment->undoneSpan();
} while ((testSegment = testSegment->next()));
if (allDone) {
fDone = true;
}
return nullptr;
}
Skia Graphics Engine
V547 CWE-571 Expression 'allDone' is always true. skopcontour.cpp 43
36
37. / 50
Loop counter analysis
for (int i = 0; i < 10; ++i)
{
// i = [INT_MIN; 9] ?
// i = [0; 9] !!!
}
37
38. / 50
Loop counter analysis
#define AE_IDLE_TIMEOUT 100
static void
ae_stop_rxmac(ae_softc_t *sc)
{
int i;
....
/*
* Wait for IDLE state.
*/
for (i = 0; i < AE_IDLE_TIMEOUT; i--) { // <=
val = AE_READ_4(sc, AE_IDLE_REG);
if ((val & (AE_IDLE_RXMAC | AE_IDLE_DMAWRITE)) == 0)
break;
DELAY(100);
}
....
}
FreeBSD Kernel
V621 Consider inspecting the 'for' operator. It's possible that the loop will be executed incorrectly or won't be executed at all. if_ae.c 1663
38
39. / 50
There is a problem
for (int i = 0; i < n; ++i) {
for (int j = i + 1; j < n; ++j) {
// j - i
}
}
39
40. / 50
There is a problem
int i = /* [0; 42] */;
int j = i + 1; // [1; 43]
int r = j - i; // [-43; 41]???
40
42. / 50
Symbolic execution
• Calculate everything in symbolic expressions
• Create a system of equations
• Upload it into SMT solver
• ???
• PROFIT
42
43. / 50
Symbolic execution
public static MMMethodKind valueOf(....) {
MMMethodKind result = OTHER;
for (MMMethodKind k : values()) {
if (k.detector.test(method) && result.level < k.level) {
if (result.level == k.level) {
throw new SpoonException(....);
}
result = k;
}
}
return result;
}
Spoon
MMMethodKind.java:129: V6007 Expression 'result.level == k.level' is always false.
43
44. / 50
Context sensitive
•foo();
•We can reset all of the accumulated information
•Annotate popular libraries
•Enjoy the 10 ways to pass a variable to a function
44
45. / 50
Context sensitive
•analysis of a function considering the context of the caller
•scales poorly
•useful for analyzing small functions (getters/setters, for example)
45
46. / 50
Context sensitive
void foo(int *p) { // analyze two times
*p = 42;
}
void bar() {
int *p = something ? new int : nullptr;
foo(p); // repeatedly analyze foo and find a bug
}
46
47. / 50
Context insensitive
void foo(int *p) { // p != nullptr
*p = 42;
}
void bar() {
int *p = something ? new int : nullptr;
foo(p); // p != nullptr contract is violated, found a bug
}
47
48. / 50
Context insensitive
• Analyze the body of a function, compose an annotation for it
• Contract for arguments
• Presence of a global state
• Returned value
• And much more
• contracts proposal
void foo(const std::vector<int> &indices)
[[expects: !indices.empty()]];
48
49. / 50
Conclusions
• Data flow analysis is a useful technique for finding errors
• To find bugs one has to operate large and sometimes strange set of properties
• The combination of different techniques allows to increase the reliability of analysis
results
• Various heuristics and assumptions allow finding more bugs
• Every significant static analyzer must use data flow analysis
49
50. / 50
Answering your questions
PVS-Studio: http://www.viva64.com/ru/pvs-studio/
50