Data-driven code analysis:
Learning from other's mistakes
Andreas Dewes (@japh44)
andreas@quantifiedcode.com
13.04.2015
PyCon 2015 – Montreal
About
Physicist and Python enthusiast
CTO of a spin-off of the
University of Munich (LMU):
We develop software for data-driven code
analysis.
Our mission
Tools & Techniques for Ensuring Code Quality
static dynamic
automated
manual
Debugging
Profiling
...
Manual
code reviews
Static analysis /
automated
code reviews
Unit testing
System testing
Integration testing
Discovering problems in code
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
obj returns only the
keys of the dictionary.
(obj.items() is needed)
value.imaginary does not exist.
(value.imag would be correct)
Dynamic Analysis (e.g. unit testing)
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
def test_encode():
d = {'a' : 1j+4,
's' : {'d' : 4+5j}}
r = encode(d) #this will fail...
assert r['a'] == {'type' :
'complex',
'r' : 4,
'i' : 1}
assert r['s']['d'] == {'type' :
'complex',
'r' : 4,
'i' : 5}
Static Analysis (for humans)
encode is a function with 1 parameter
which always returns a dict.
I: obj should be an iterator/list of tuples
with two elements.
encode gets called with a
dict, which does not satisfy (I).
a value of type complex does not
have an .imaginary attribute!
encode is called with a dict, which
again does not satisfy (I).
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
How static analysis tools works (short version)
1. Compile the code into a data
structure, typically an abstract syntax
tree (AST)
2. (Optionally) annotate it with
additional information to make
analysis easier
3. Parse the (AST) data to find problems.
Python Tools for Static Analysis
PyLint (most comprehensive tool)
http://www.pylint.org/
PyFlakes (smaller, less verbose)
https://pypi.python.org/pypi/pyflakes
Pep8 (style and some structural checks)
https://pypi.python.org/pypi/pep8
(... and many others)
Limitations of current tools & technologies
Checks are hard to create / modify...
(example: PyLint code for analyzing 'try/except' statements)
Long feedback cycles
Rethinking code analysis for Python
Our approach
1. Code is data! Let's not keep it in text
files but store it in a useful form that we
can work with easily (e.g. a graph).
2. Make it super-easy to specify errors
and bad code patterns.
3. Make it possible to learn from user
feedback and publicly available code.
Building the Code Graph
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
dict
name
name
assign
functiondef
body
body
targets
for
body iterator
Building the Code Graph
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
value
{i : 1}
{id : 'e'}
{name: 'encode',
args : [...]}
{i:0}
Building the Code Graph
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
e4fa76b...
a76fbc41...
c51fa291...
74af219...
name
name
assign
body
body
targets
for
body iterator
value
dict
functiondef
$type: dict
Example: Tornado Project
10 modules from the tornado project
Modules
Classes
Functions
Advantages
- Simple detection of (exact) duplicates
- Semantic diffing of modules, classes, functions, ...
- Semantic code search on the whole tree
Describing Code Errors / Anti-Patterns
Code issues = patterns on the graph
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
name
attribute
value
attr
{id : imaginary}
name
$type {id : value}
complex
Using YAML to describe graph patterns
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
node_type: attribute
value:
$type: complex
attr: imaginary
Generalizing patterns
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
node_type: attribute
value:
$type: complex
attr:
$not:
$or: [real, imagin]
Learning from feedback / false positives
"else" in for loop without break statement
node_type: for
body:
$not:
$anywhere:
node_type: break
orelse:
$anything: {}
values = ["foo", "bar", ... ]
for i,value in enumerate(values):
if value == 'baz':
print "Found it!"
else:
print "didn't find 'baz'!"
Learning from false positives (I)
values = ["foo", "bar", ... ]
for i,value in enumerate(values):
if value == 'baz':
print "Found it!"
return value
else:
print "didn't find 'baz'!"
node_type: for
body:
$not:
$or:
- $anywhere:
node_type: break
- $anywhere:
node_type: return
orelse:
$anything: {}
Learning from false positives (II)
node_type: for
body:
$not:
$or:
- $anywhere:
node_type: break
exclude:
node_type:
$or: [while,for]
- $anywhere:
node_type: return
orelse:
$anything: {}
values = ["foo", "bar", ... ]
for i,value in enumerate(values):
if value == 'baz':
print "Found it!"
for j in ...:
#...
break
else:
print "didn't find 'baz'!"
patterns vs. code
handlers:
node_type: excepthandler
type: null
node_type: tryexcept
handlers:
- body:
- node_type: pass
node_type: excepthandler
node_type: tryexcept
(no exception type specified)
(empty exception handler)
Summary & Feedback
1. Storing code as a graph opens up many
interesting possibilities. Let's stop thinking of
code as text!
2. We can learn from user feedback or even
use machine learning to create and adapt
code patterns!
3. Everyone can write code checkers!
=> crowd-source code quality!
Thanks!
www.quantifiedcode.com
https://github.com/quantifiedcode
@quantifiedcode
Andreas Dewes (@japh44)
andreas@quantifiedcode.com
Visit us at booth 629!

Learning from other's mistakes: Data-driven code analysis

  • 1.
    Data-driven code analysis: Learningfrom other's mistakes Andreas Dewes (@japh44) andreas@quantifiedcode.com 13.04.2015 PyCon 2015 – Montreal
  • 2.
    About Physicist and Pythonenthusiast CTO of a spin-off of the University of Munich (LMU): We develop software for data-driven code analysis.
  • 3.
  • 4.
    Tools & Techniquesfor Ensuring Code Quality static dynamic automated manual Debugging Profiling ... Manual code reviews Static analysis / automated code reviews Unit testing System testing Integration testing
  • 5.
    Discovering problems incode def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) obj returns only the keys of the dictionary. (obj.items() is needed) value.imaginary does not exist. (value.imag would be correct)
  • 6.
    Dynamic Analysis (e.g.unit testing) def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) def test_encode(): d = {'a' : 1j+4, 's' : {'d' : 4+5j}} r = encode(d) #this will fail... assert r['a'] == {'type' : 'complex', 'r' : 4, 'i' : 1} assert r['s']['d'] == {'type' : 'complex', 'r' : 4, 'i' : 5}
  • 7.
    Static Analysis (forhumans) encode is a function with 1 parameter which always returns a dict. I: obj should be an iterator/list of tuples with two elements. encode gets called with a dict, which does not satisfy (I). a value of type complex does not have an .imaginary attribute! encode is called with a dict, which again does not satisfy (I). def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)
  • 8.
    How static analysistools works (short version) 1. Compile the code into a data structure, typically an abstract syntax tree (AST) 2. (Optionally) annotate it with additional information to make analysis easier 3. Parse the (AST) data to find problems.
  • 9.
    Python Tools forStatic Analysis PyLint (most comprehensive tool) http://www.pylint.org/ PyFlakes (smaller, less verbose) https://pypi.python.org/pypi/pyflakes Pep8 (style and some structural checks) https://pypi.python.org/pypi/pep8 (... and many others)
  • 10.
    Limitations of currenttools & technologies
  • 11.
    Checks are hardto create / modify... (example: PyLint code for analyzing 'try/except' statements)
  • 12.
  • 13.
  • 14.
    Our approach 1. Codeis data! Let's not keep it in text files but store it in a useful form that we can work with easily (e.g. a graph). 2. Make it super-easy to specify errors and bad code patterns. 3. Make it possible to learn from user feedback and publicly available code.
  • 15.
    Building the CodeGraph def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)
  • 16.
    dict name name assign functiondef body body targets for body iterator Building theCode Graph def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) value
  • 17.
    {i : 1} {id: 'e'} {name: 'encode', args : [...]} {i:0} Building the Code Graph def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) e4fa76b... a76fbc41... c51fa291... 74af219... name name assign body body targets for body iterator value dict functiondef $type: dict
  • 18.
    Example: Tornado Project 10modules from the tornado project Modules Classes Functions
  • 19.
    Advantages - Simple detectionof (exact) duplicates - Semantic diffing of modules, classes, functions, ... - Semantic code search on the whole tree
  • 20.
    Describing Code Errors/ Anti-Patterns
  • 21.
    Code issues =patterns on the graph def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) name attribute value attr {id : imaginary} name $type {id : value} complex
  • 22.
    Using YAML todescribe graph patterns def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) node_type: attribute value: $type: complex attr: imaginary
  • 23.
    Generalizing patterns def encode(obj): """ Encodea (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) node_type: attribute value: $type: complex attr: $not: $or: [real, imagin]
  • 24.
    Learning from feedback/ false positives
  • 25.
    "else" in forloop without break statement node_type: for body: $not: $anywhere: node_type: break orelse: $anything: {} values = ["foo", "bar", ... ] for i,value in enumerate(values): if value == 'baz': print "Found it!" else: print "didn't find 'baz'!"
  • 26.
    Learning from falsepositives (I) values = ["foo", "bar", ... ] for i,value in enumerate(values): if value == 'baz': print "Found it!" return value else: print "didn't find 'baz'!" node_type: for body: $not: $or: - $anywhere: node_type: break - $anywhere: node_type: return orelse: $anything: {}
  • 27.
    Learning from falsepositives (II) node_type: for body: $not: $or: - $anywhere: node_type: break exclude: node_type: $or: [while,for] - $anywhere: node_type: return orelse: $anything: {} values = ["foo", "bar", ... ] for i,value in enumerate(values): if value == 'baz': print "Found it!" for j in ...: #... break else: print "didn't find 'baz'!"
  • 28.
    patterns vs. code handlers: node_type:excepthandler type: null node_type: tryexcept handlers: - body: - node_type: pass node_type: excepthandler node_type: tryexcept (no exception type specified) (empty exception handler)
  • 29.
    Summary & Feedback 1.Storing code as a graph opens up many interesting possibilities. Let's stop thinking of code as text! 2. We can learn from user feedback or even use machine learning to create and adapt code patterns! 3. Everyone can write code checkers! => crowd-source code quality!
  • 30.

Editor's Notes

  • #16 First, we transform the code into a so-called „abstract syntax tree“. This is a representation that can be easily manipulated programatically.
  • #17 First, we transform the code into a so-called „abstract syntax tree“. This is a representation that can be easily manipulated programatically.
  • #18 First, we transform the code into a so-called „abstract syntax tree“. This is a representation that can be easily manipulated programatically.
  • #19 We store all syntax trees of the project in a graph database (either on-disk or in-memory) to be able to perform queries on the graph and store it for later analysis. Nodes in modules can be linked, e.g. to point from a function call in a given module to the definition of that function in another module.