Static code analysis is an useful tool that can help to detect bugs early in the software development life cycle. I will explain the basics of static analysis and show the challenges we face when analyzing Python code. I will introduce a data-driven approach to code analysis that makes use of public code and example-based learning and show how it can be applied to analyzing Python code.
4. Tools & Techniques for Ensuring Code Quality
static dynamic
automated
manual
Debugging
Profiling
...
Manual
code reviews
Static analysis /
automated
code reviews
Unit testing
System testing
Integration testing
5. Discovering problems in code
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
obj returns only the
keys of the dictionary.
(obj.items() is needed)
value.imaginary does not exist.
(value.imag would be correct)
6. Dynamic Analysis (e.g. unit testing)
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
def test_encode():
d = {'a' : 1j+4,
's' : {'d' : 4+5j}}
r = encode(d) #this will fail...
assert r['a'] == {'type' :
'complex',
'r' : 4,
'i' : 1}
assert r['s']['d'] == {'type' :
'complex',
'r' : 4,
'i' : 5}
7. Static Analysis (for humans)
encode is a function with 1 parameter
which always returns a dict.
I: obj should be an iterator/list of tuples
with two elements.
encode gets called with a
dict, which does not satisfy (I).
a value of type complex does not
have an .imaginary attribute!
encode is called with a dict, which
again does not satisfy (I).
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
8. How static analysis tools works (short version)
1. Compile the code into a data
structure, typically an abstract syntax
tree (AST)
2. (Optionally) annotate it with
additional information to make
analysis easier
3. Parse the (AST) data to find problems.
9. Python Tools for Static Analysis
PyLint (most comprehensive tool)
http://www.pylint.org/
PyFlakes (smaller, less verbose)
https://pypi.python.org/pypi/pyflakes
Pep8 (style and some structural checks)
https://pypi.python.org/pypi/pep8
(... and many others)
14. Our approach
1. Code is data! Let's not keep it in text
files but store it in a useful form that we
can work with easily (e.g. a graph).
2. Make it super-easy to specify errors
and bad code patterns.
3. Make it possible to learn from user
feedback and publicly available code.
15. Building the Code Graph
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
16. dict
name
name
assign
functiondef
body
body
targets
for
body iterator
Building the Code Graph
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
value
17. {i : 1}
{id : 'e'}
{name: 'encode',
args : [...]}
{i:0}
Building the Code Graph
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
e4fa76b...
a76fbc41...
c51fa291...
74af219...
name
name
assign
body
body
targets
for
body iterator
value
dict
functiondef
$type: dict
19. Advantages
- Simple detection of (exact) duplicates
- Semantic diffing of modules, classes, functions, ...
- Semantic code search on the whole tree
21. Code issues = patterns on the graph
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
name
attribute
value
attr
{id : imaginary}
name
$type {id : value}
complex
22. Using YAML to describe graph patterns
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
node_type: attribute
value:
$type: complex
attr: imaginary
23. Generalizing patterns
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
node_type: attribute
value:
$type: complex
attr:
$not:
$or: [real, imagin]
25. "else" in for loop without break statement
node_type: for
body:
$not:
$anywhere:
node_type: break
orelse:
$anything: {}
values = ["foo", "bar", ... ]
for i,value in enumerate(values):
if value == 'baz':
print "Found it!"
else:
print "didn't find 'baz'!"
26. Learning from false positives (I)
values = ["foo", "bar", ... ]
for i,value in enumerate(values):
if value == 'baz':
print "Found it!"
return value
else:
print "didn't find 'baz'!"
node_type: for
body:
$not:
$or:
- $anywhere:
node_type: break
- $anywhere:
node_type: return
orelse:
$anything: {}
27. Learning from false positives (II)
node_type: for
body:
$not:
$or:
- $anywhere:
node_type: break
exclude:
node_type:
$or: [while,for]
- $anywhere:
node_type: return
orelse:
$anything: {}
values = ["foo", "bar", ... ]
for i,value in enumerate(values):
if value == 'baz':
print "Found it!"
for j in ...:
#...
break
else:
print "didn't find 'baz'!"
28. patterns vs. code
handlers:
node_type: excepthandler
type: null
node_type: tryexcept
handlers:
- body:
- node_type: pass
node_type: excepthandler
node_type: tryexcept
(no exception type specified)
(empty exception handler)
29. Summary & Feedback
1. Storing code as a graph opens up many
interesting possibilities. Let's stop thinking of
code as text!
2. We can learn from user feedback or even
use machine learning to create and adapt
code patterns!
3. Everyone can write code checkers!
=> crowd-source code quality!
First, we transform the code into a so-called „abstract syntax tree“. This is a representation that can be easily manipulated programatically.
First, we transform the code into a so-called „abstract syntax tree“. This is a representation that can be easily manipulated programatically.
First, we transform the code into a so-called „abstract syntax tree“. This is a representation that can be easily manipulated programatically.
We store all syntax trees of the project in a graph database (either on-disk or in-memory) to be able to perform queries on the graph and store it for later analysis. Nodes in modules can be linked, e.g. to point from a function call in a given module to the definition of that function in another module.