This document discusses whether static analysis needs machine learning. It begins with an introduction to static analysis and outlines existing static analysis solutions like DeepCode, Infer, SapFix, Embold, Source{d}, Clever-Commit, and CodeGuru. It then addresses problems with learning manually or from real large code bases, like outdated code and lack of documentation. Finally, it discusses promising approaches like analyzing code style, collecting additional metrics, and best practices for specific frameworks.
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Does static analysis need machine learning
1. Does static analysis need
machine learning?
Anti-Talk
Victoria Khanieva
PVS-Studio
2. Speaker
2
Victoria Khanieva
• С++ developer in PVS-Studio
• Supported the MISRA standard
• Wrote articles in checks of open-source
projects
khanieva@viva64.com
www.viva64.com
3. Introduction to static analysis
Existing solutions and approaches they implement
Problems and pitfalls when creating an analyzer:
When learning «manually»
When learning on a real large code base
Most promising approaches
Agenda
3
8. How to reveal errors and flaws in the source code
of programs.
Detect errors in programs
Get tips on code formatting
Count metrics
….
Static analysis
8
21. Java, C, C++, Objective-C
By Facebook
Open-source code
You can try Infer on your projects
Based on the Хоара and separation logic,
bi-abduction, and the abstract interpretation
theory
Infer
21
Link
23. Platform to analyze code quality
System of edits suggestion
Searches for dependencies
between functions and methods
by NLP
Embold
23
24. Open-source
Related posts
Repository with dataset for learning
Code-style detection
Platform for collecting metrics and statistics
Source{d}
24
Link
25. Fixing code style in Source{d}
25
Based on the article
“STYLE-ANALYZER: fixing
code style inconsistencies
with interpretable
unsupervised algorithms”
Link
26. By Mozilla+Ubisoft
Searches for suspicious commits
Based on the publication: “CLEVER: Combining Code
Metrics with Clone Detection for Just-In-Time Fault
Prevention and Resolution in Large Industrial Projects”
Clever-Commit
26
Link
27. Java
By Amazon
Recommendations on best practices from the
documentation and code base
CodeGuru
27
29. Analyze code to search for errors
Analyze code to search for deviations from best
practices
Analyze artifacts’ code
Collect metrics and data on code
Suggest code-style fixes
Main directions
29
30. Selected base of open-source repositories
Dataset selected manually
Own project base
Ways to learn
30
32. How it may look like:
• if (X && A == A)
• if (A + 1 == A + 1)
• if (A[i] == A[i])
• if ((A) == (A))
• …
«Manual» dataset selection
32
We need to find:
if (A == A)
35. We need to find:
int y = x / 0;
In practice
35
How it may look like:
template <class T> class numeric_limits {
....
}
namespace boost {
....
}
namespace boost {
namespace hash_detail {
template <class T> void dsizet(size_t x) {
size_t length = x / (limits<int>::digits - 31);
}
}
}
36. @Override
public String getText(Mode mode) {
StringBuilder sb = new StringBuilder();
....
if (filter.getMessage()
.toLowerCase(Locale.ENGLISH)
.startsWith("Each ")) {
sb.append(" has base power and toughness ");
} else {
sb.append(" have base power and toughness ");
}
....
return sb.toString();
}
Data flow analysis
36
37. Data flow analysis
37
uint32_t* BnNew() {
uint32_t* result = new uint32_t[kBigIntSize];
memset(result, 0, kBigIntSize * sizeof(uint32_t));
return result;
}
std::string AndroidRSAPublicKey(crypto::RSAPrivateKey* key) {
....
uint32_t* n = BnNew();
....
RSAPublicKey pkey;
....
if (pkey.n0inv == 0)
return kDummyRSAPublicKey; // <=
....
}
38. «So many projects on GitHub! The analyzer will learn from their
repositories and commits» turns into commits’ collection and
markup.
If a manually collected learning base is unreliable, what to
expect from an automatically collected one?
Learning on many projects
38
39. Check out the commit with the word «fix»:
Learning on many projects
39
40. Analyzer has to be up-to-date in terms of the checked
language
Most projects use outdated standards
Most projects don’t use new constructions
Outdated code
40
45. Code example:
ObjectOutputStream out = new ObjectOutputStream(....);
SerializedObject obj = new SerializedObject();
obj.state = 100;
out.writeObject(obj);
obj.state = 200;
out.writeObject(obj);
out.close();
Why documentation matters
45
46. The analyzer suggests:
ObjectOutputStream out = new ObjectOutputStream(....);
SerializedObject obj = new SerializedObject();
obj.state = 100;
out.writeObject(obj);
obj = new SerializedObject(); // Add this line
obj.state = 200;
out.writeObject(obj);
out.close();
Why documentation matters
46
47. What happens without the edit:
ObjectOutputStream out = new ObjectOutputStream(....);
SerializedObject obj = new SerializedObject();
obj.state = 100;
out.writeObject(obj); // stores the object with the state = 100
obj.state = 200;
out.writeObject(obj); // stores the object with the state = 100
out.close();
Why documentation matters
47
52. Reason for getting a warning may be unclear.
Reason for NOT getting a warning may be unclear as well.
How to fix?
Additional learning (will it help?)
Mechanism to hide warnings (not universal)
False positives
52