Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Learning from other's mistakes: Data-driven code analysis

980 views

Published on

Static code analysis is an useful tool that can help to detect bugs early in the software development life cycle. I will explain the basics of static analysis and show the challenges we face when analyzing Python code. I will introduce a data-driven approach to code analysis that makes use of public code and example-based learning and show how it can be applied to analyzing Python code.

Published in: Technology
  • Be the first to comment

Learning from other's mistakes: Data-driven code analysis

  1. 1. Data-driven code analysis: Learning from other's mistakes Andreas Dewes (@japh44) andreas@quantifiedcode.com 13.04.2015 PyCon 2015 – Montreal
  2. 2. About Physicist and Python enthusiast CTO of a spin-off of the University of Munich (LMU): We develop software for data-driven code analysis.
  3. 3. Our mission
  4. 4. Tools & Techniques for Ensuring Code Quality static dynamic automated manual Debugging Profiling ... Manual code reviews Static analysis / automated code reviews Unit testing System testing Integration testing
  5. 5. Discovering problems in code def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) obj returns only the keys of the dictionary. (obj.items() is needed) value.imaginary does not exist. (value.imag would be correct)
  6. 6. Dynamic Analysis (e.g. unit testing) def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) def test_encode(): d = {'a' : 1j+4, 's' : {'d' : 4+5j}} r = encode(d) #this will fail... assert r['a'] == {'type' : 'complex', 'r' : 4, 'i' : 1} assert r['s']['d'] == {'type' : 'complex', 'r' : 4, 'i' : 5}
  7. 7. Static Analysis (for humans) encode is a function with 1 parameter which always returns a dict. I: obj should be an iterator/list of tuples with two elements. encode gets called with a dict, which does not satisfy (I). a value of type complex does not have an .imaginary attribute! encode is called with a dict, which again does not satisfy (I). def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)
  8. 8. How static analysis tools works (short version) 1. Compile the code into a data structure, typically an abstract syntax tree (AST) 2. (Optionally) annotate it with additional information to make analysis easier 3. Parse the (AST) data to find problems.
  9. 9. Python Tools for Static Analysis PyLint (most comprehensive tool) http://www.pylint.org/ PyFlakes (smaller, less verbose) https://pypi.python.org/pypi/pyflakes Pep8 (style and some structural checks) https://pypi.python.org/pypi/pep8 (... and many others)
  10. 10. Limitations of current tools & technologies
  11. 11. Checks are hard to create / modify... (example: PyLint code for analyzing 'try/except' statements)
  12. 12. Long feedback cycles
  13. 13. Rethinking code analysis for Python
  14. 14. Our approach 1. Code is data! Let's not keep it in text files but store it in a useful form that we can work with easily (e.g. a graph). 2. Make it super-easy to specify errors and bad code patterns. 3. Make it possible to learn from user feedback and publicly available code.
  15. 15. Building the Code Graph def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)
  16. 16. dict name name assign functiondef body body targets for body iterator Building the Code Graph def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) value
  17. 17. {i : 1} {id : 'e'} {name: 'encode', args : [...]} {i:0} Building the Code Graph def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) e4fa76b... a76fbc41... c51fa291... 74af219... name name assign body body targets for body iterator value dict functiondef $type: dict
  18. 18. Example: Tornado Project 10 modules from the tornado project Modules Classes Functions
  19. 19. Advantages - Simple detection of (exact) duplicates - Semantic diffing of modules, classes, functions, ... - Semantic code search on the whole tree
  20. 20. Describing Code Errors / Anti-Patterns
  21. 21. Code issues = patterns on the graph def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) name attribute value attr {id : imaginary} name $type {id : value} complex
  22. 22. Using YAML to describe graph patterns def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) node_type: attribute value: $type: complex attr: imaginary
  23. 23. Generalizing patterns def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) node_type: attribute value: $type: complex attr: $not: $or: [real, imagin]
  24. 24. Learning from feedback / false positives
  25. 25. "else" in for loop without break statement node_type: for body: $not: $anywhere: node_type: break orelse: $anything: {} values = ["foo", "bar", ... ] for i,value in enumerate(values): if value == 'baz': print "Found it!" else: print "didn't find 'baz'!"
  26. 26. Learning from false positives (I) values = ["foo", "bar", ... ] for i,value in enumerate(values): if value == 'baz': print "Found it!" return value else: print "didn't find 'baz'!" node_type: for body: $not: $or: - $anywhere: node_type: break - $anywhere: node_type: return orelse: $anything: {}
  27. 27. Learning from false positives (II) node_type: for body: $not: $or: - $anywhere: node_type: break exclude: node_type: $or: [while,for] - $anywhere: node_type: return orelse: $anything: {} values = ["foo", "bar", ... ] for i,value in enumerate(values): if value == 'baz': print "Found it!" for j in ...: #... break else: print "didn't find 'baz'!"
  28. 28. patterns vs. code handlers: node_type: excepthandler type: null node_type: tryexcept handlers: - body: - node_type: pass node_type: excepthandler node_type: tryexcept (no exception type specified) (empty exception handler)
  29. 29. Summary & Feedback 1. Storing code as a graph opens up many interesting possibilities. Let's stop thinking of code as text! 2. We can learn from user feedback or even use machine learning to create and adapt code patterns! 3. Everyone can write code checkers! => crowd-source code quality!
  30. 30. Thanks! www.quantifiedcode.com https://github.com/quantifiedcode @quantifiedcode Andreas Dewes (@japh44) andreas@quantifiedcode.com Visit us at booth 629!

×