Learning from other's mistakes: Data-driven code analysis

Data-driven code analysis:
Learning from other's mistakes
Andreas Dewes (@japh44)
andreas@quantifiedcode.com
13.04.2015
PyCon 2015 – Montreal

About
Physicist and Python enthusiast
CTO of a spin-off of the
University of Munich (LMU):
We develop software for data-driven code
analysis.

Tools & Techniques for Ensuring Code Quality
static dynamic
automated
manual
Debugging
Profiling
...
Manual
code reviews
Static analysis /
automated
code reviews
Unit testing
System testing
Integration testing

Discovering problems in code
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
obj returns only the
keys of the dictionary.
(obj.items() is needed)
value.imaginary does not exist.
(value.imag would be correct)

Dynamic Analysis (e.g. unit testing)
def encode(obj):
"""
using JSON.
"""
e = {}
'r' : value.real,
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
def test_encode():
d = {'a' : 1j+4,
's' : {'d' : 4+5j}}
r = encode(d) #this will fail...
assert r['a'] == {'type' :
'complex',
'r' : 4,
'i' : 1}
assert r['s']['d'] == {'type' :
'complex',
'r' : 4,
'i' : 5}

Static Analysis (for humans)
encode is a function with 1 parameter
which always returns a dict.
I: obj should be an iterator/list of tuples
with two elements.
encode gets called with a
dict, which does not satisfy (I).
a value of type complex does not
have an .imaginary attribute!
encode is called with a dict, which
again does not satisfy (I).
def encode(obj):
"""
using JSON.
"""
e = {}
'r' : value.real,
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)

How static analysis tools works (short version)
1. Compile the code into a data
structure, typically an abstract syntax
tree (AST)
2. (Optionally) annotate it with
additional information to make
analysis easier
3. Parse the (AST) data to find problems.

Python Tools for Static Analysis
PyLint (most comprehensive tool)
http://www.pylint.org/
PyFlakes (smaller, less verbose)
https://pypi.python.org/pypi/pyflakes
Pep8 (style and some structural checks)
https://pypi.python.org/pypi/pep8
(... and many others)

Limitations of current tools & technologies

Checks are hard to create / modify...
(example: PyLint code for analyzing 'try/except' statements)

Rethinking code analysis for Python

Our approach
1. Code is data! Let's not keep it in text
files but store it in a useful form that we
can work with easily (e.g. a graph).
2. Make it super-easy to specify errors
and bad code patterns.
3. Make it possible to learn from user
feedback and publicly available code.

Building the Code Graph
def encode(obj):
"""
using JSON.
"""
e = {}
'r' : value.real,
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)

dict
name
name
assign
functiondef
body
body
targets
for
body iterator
def encode(obj):
"""
using JSON.
"""
e = {}
'r' : value.real,
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
value

{i : 1}
{id : 'e'}
{name: 'encode',
args : [...]}
{i:0}
def encode(obj):
"""
using JSON.
"""
e = {}
'r' : value.real,
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
e4fa76b...
a76fbc41...
c51fa291...
74af219...
name
name
assign
body
body
targets
for
body iterator
value
dict
functiondef
$type: dict

Example: Tornado Project
10 modules from the tornado project
Modules
Classes
Functions

Advantages
- Simple detection of (exact) duplicates
- Semantic diffing of modules, classes, functions, ...
- Semantic code search on the whole tree

Describing Code Errors / Anti-Patterns

Code issues = patterns on the graph
def encode(obj):
"""
using JSON.
"""
e = {}
'r' : value.real,
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
name
attribute
value
attr
{id : imaginary}
name
$type {id : value}
complex

Using YAML to describe graph patterns
def encode(obj):
"""
using JSON.
"""
e = {}
'r' : value.real,
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
node_type: attribute
value:
$type: complex
attr: imaginary

Generalizing patterns
def encode(obj):
"""
using JSON.
"""
e = {}
'r' : value.real,
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
node_type: attribute
value:
$type: complex
attr:
$not:
$or: [real, imagin]

Learning from feedback / false positives

"else" in for loop without break statement
node_type: for
body:
$not:
$anywhere:
node_type: break
orelse:
$anything: {}
values = ["foo", "bar", ... ]
for i,value in enumerate(values):
if value == 'baz':
print "Found it!"
else:
print "didn't find 'baz'!"

Learning from false positives (I)
if value == 'baz':
print "Found it!"
return value
else:
node_type: for
body:
$not:
$or:
- $anywhere:
node_type: break
- $anywhere:
node_type: return
orelse:
$anything: {}

Learning from false positives (II)
node_type: for
body:
$not:
$or:
- $anywhere:
node_type: break
exclude:
node_type:
$or: [while,for]
- $anywhere:
node_type: return
orelse:
$anything: {}
if value == 'baz':
print "Found it!"
for j in ...:
#...
break
else:

patterns vs. code
handlers:
node_type: excepthandler
type: null
node_type: tryexcept
handlers:
- body:
- node_type: pass
node_type: excepthandler
node_type: tryexcept
(no exception type specified)
(empty exception handler)

Summary & Feedback
1. Storing code as a graph opens up many
interesting possibilities. Let's stop thinking of
code as text!
2. We can learn from user feedback or even
use machine learning to create and adapt
code patterns!
3. Everyone can write code checkers!
=> crowd-source code quality!

Thanks!
www.quantifiedcode.com
https://github.com/quantifiedcode
@quantifiedcode
Andreas Dewes (@japh44)
andreas@quantifiedcode.com
Visit us at booth 629!

Learning from other's mistakes: Data-driven code analysis

More Related Content

What's hot

Viewers also liked

Similar to Learning from other's mistakes: Data-driven code analysis

More from Andreas Dewes

Recently uploaded

Learning from other's mistakes: Data-driven code analysis

Editor's Notes