SlideShare a Scribd company logo
1 of 30
Data-driven code analysis:
Learning from other's mistakes
Andreas Dewes (@japh44)
andreas@quantifiedcode.com
13.04.2015
PyCon 2015 – Montreal
About
Physicist and Python enthusiast
CTO of a spin-off of the
University of Munich (LMU):
We develop software for data-driven code
analysis.
Our mission
Tools & Techniques for Ensuring Code Quality
static dynamic
automated
manual
Debugging
Profiling
...
Manual
code reviews
Static analysis /
automated
code reviews
Unit testing
System testing
Integration testing
Discovering problems in code
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
obj returns only the
keys of the dictionary.
(obj.items() is needed)
value.imaginary does not exist.
(value.imag would be correct)
Dynamic Analysis (e.g. unit testing)
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
def test_encode():
d = {'a' : 1j+4,
's' : {'d' : 4+5j}}
r = encode(d) #this will fail...
assert r['a'] == {'type' :
'complex',
'r' : 4,
'i' : 1}
assert r['s']['d'] == {'type' :
'complex',
'r' : 4,
'i' : 5}
Static Analysis (for humans)
encode is a function with 1 parameter
which always returns a dict.
I: obj should be an iterator/list of tuples
with two elements.
encode gets called with a
dict, which does not satisfy (I).
a value of type complex does not
have an .imaginary attribute!
encode is called with a dict, which
again does not satisfy (I).
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
How static analysis tools works (short version)
1. Compile the code into a data
structure, typically an abstract syntax
tree (AST)
2. (Optionally) annotate it with
additional information to make
analysis easier
3. Parse the (AST) data to find problems.
Python Tools for Static Analysis
PyLint (most comprehensive tool)
http://www.pylint.org/
PyFlakes (smaller, less verbose)
https://pypi.python.org/pypi/pyflakes
Pep8 (style and some structural checks)
https://pypi.python.org/pypi/pep8
(... and many others)
Limitations of current tools & technologies
Checks are hard to create / modify...
(example: PyLint code for analyzing 'try/except' statements)
Long feedback cycles
Rethinking code analysis for Python
Our approach
1. Code is data! Let's not keep it in text
files but store it in a useful form that we
can work with easily (e.g. a graph).
2. Make it super-easy to specify errors
and bad code patterns.
3. Make it possible to learn from user
feedback and publicly available code.
Building the Code Graph
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
dict
name
name
assign
functiondef
body
body
targets
for
body iterator
Building the Code Graph
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
value
{i : 1}
{id : 'e'}
{name: 'encode',
args : [...]}
{i:0}
Building the Code Graph
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
e4fa76b...
a76fbc41...
c51fa291...
74af219...
name
name
assign
body
body
targets
for
body iterator
value
dict
functiondef
$type: dict
Example: Tornado Project
10 modules from the tornado project
Modules
Classes
Functions
Advantages
- Simple detection of (exact) duplicates
- Semantic diffing of modules, classes, functions, ...
- Semantic code search on the whole tree
Describing Code Errors / Anti-Patterns
Code issues = patterns on the graph
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
name
attribute
value
attr
{id : imaginary}
name
$type {id : value}
complex
Using YAML to describe graph patterns
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
node_type: attribute
value:
$type: complex
attr: imaginary
Generalizing patterns
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj:
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}}
print encode(d)
node_type: attribute
value:
$type: complex
attr:
$not:
$or: [real, imagin]
Learning from feedback / false positives
"else" in for loop without break statement
node_type: for
body:
$not:
$anywhere:
node_type: break
orelse:
$anything: {}
values = ["foo", "bar", ... ]
for i,value in enumerate(values):
if value == 'baz':
print "Found it!"
else:
print "didn't find 'baz'!"
Learning from false positives (I)
values = ["foo", "bar", ... ]
for i,value in enumerate(values):
if value == 'baz':
print "Found it!"
return value
else:
print "didn't find 'baz'!"
node_type: for
body:
$not:
$or:
- $anywhere:
node_type: break
- $anywhere:
node_type: return
orelse:
$anything: {}
Learning from false positives (II)
node_type: for
body:
$not:
$or:
- $anywhere:
node_type: break
exclude:
node_type:
$or: [while,for]
- $anywhere:
node_type: return
orelse:
$anything: {}
values = ["foo", "bar", ... ]
for i,value in enumerate(values):
if value == 'baz':
print "Found it!"
for j in ...:
#...
break
else:
print "didn't find 'baz'!"
patterns vs. code
handlers:
node_type: excepthandler
type: null
node_type: tryexcept
handlers:
- body:
- node_type: pass
node_type: excepthandler
node_type: tryexcept
(no exception type specified)
(empty exception handler)
Summary & Feedback
1. Storing code as a graph opens up many
interesting possibilities. Let's stop thinking of
code as text!
2. We can learn from user feedback or even
use machine learning to create and adapt
code patterns!
3. Everyone can write code checkers!
=> crowd-source code quality!
Thanks!
www.quantifiedcode.com
https://github.com/quantifiedcode
@quantifiedcode
Andreas Dewes (@japh44)
andreas@quantifiedcode.com
Visit us at booth 629!

More Related Content

What's hot

STCW Basic Safety Training
STCW Basic Safety TrainingSTCW Basic Safety Training
STCW Basic Safety TrainingMatthew Peck
 
Web Cache Deception Attack
Web Cache Deception AttackWeb Cache Deception Attack
Web Cache Deception AttackOmer Gil
 
Advanced Functions Unit 1
Advanced Functions Unit 1Advanced Functions Unit 1
Advanced Functions Unit 1leefong2310
 
Presentation of "On the effectiveness of route-based packet filtering for dis...
Presentation of "On the effectiveness of route-based packet filtering for dis...Presentation of "On the effectiveness of route-based packet filtering for dis...
Presentation of "On the effectiveness of route-based packet filtering for dis...Jammy Wang
 
Future Maritime Security Challenges: What to Expect and How To Prepare?
Future Maritime Security Challenges: What to Expect and How To Prepare?Future Maritime Security Challenges: What to Expect and How To Prepare?
Future Maritime Security Challenges: What to Expect and How To Prepare?Heiko Borchert
 
2022 APIsecure_Method for exploiting IDOR on nodejs+mongodb based backend
2022 APIsecure_Method for exploiting IDOR on nodejs+mongodb based backend2022 APIsecure_Method for exploiting IDOR on nodejs+mongodb based backend
2022 APIsecure_Method for exploiting IDOR on nodejs+mongodb based backendAPIsecure_ Official
 

What's hot (7)

STCW Basic Safety Training
STCW Basic Safety TrainingSTCW Basic Safety Training
STCW Basic Safety Training
 
Web Cache Deception Attack
Web Cache Deception AttackWeb Cache Deception Attack
Web Cache Deception Attack
 
Advanced Functions Unit 1
Advanced Functions Unit 1Advanced Functions Unit 1
Advanced Functions Unit 1
 
Shellscripting
ShellscriptingShellscripting
Shellscripting
 
Presentation of "On the effectiveness of route-based packet filtering for dis...
Presentation of "On the effectiveness of route-based packet filtering for dis...Presentation of "On the effectiveness of route-based packet filtering for dis...
Presentation of "On the effectiveness of route-based packet filtering for dis...
 
Future Maritime Security Challenges: What to Expect and How To Prepare?
Future Maritime Security Challenges: What to Expect and How To Prepare?Future Maritime Security Challenges: What to Expect and How To Prepare?
Future Maritime Security Challenges: What to Expect and How To Prepare?
 
2022 APIsecure_Method for exploiting IDOR on nodejs+mongodb based backend
2022 APIsecure_Method for exploiting IDOR on nodejs+mongodb based backend2022 APIsecure_Method for exploiting IDOR on nodejs+mongodb based backend
2022 APIsecure_Method for exploiting IDOR on nodejs+mongodb based backend
 

Viewers also liked

Let's build a quantum computer!
Let's build a quantum computer!Let's build a quantum computer!
Let's build a quantum computer!Andreas Dewes
 
PVS-Studio and static code analysis technique
PVS-Studio and static code analysis techniquePVS-Studio and static code analysis technique
PVS-Studio and static code analysis techniqueAndrey Karpov
 
Quantum Computing: Welcome to the Future
Quantum Computing: Welcome to the FutureQuantum Computing: Welcome to the Future
Quantum Computing: Welcome to the FutureVernBrownell
 
Quantum computing - Introduction
Quantum computing - IntroductionQuantum computing - Introduction
Quantum computing - Introductionrushmila
 

Viewers also liked (6)

Let's build a quantum computer!
Let's build a quantum computer!Let's build a quantum computer!
Let's build a quantum computer!
 
PVS-Studio and static code analysis technique
PVS-Studio and static code analysis techniquePVS-Studio and static code analysis technique
PVS-Studio and static code analysis technique
 
Quantum Computing: Welcome to the Future
Quantum Computing: Welcome to the FutureQuantum Computing: Welcome to the Future
Quantum Computing: Welcome to the Future
 
Quantum computing - Introduction
Quantum computing - IntroductionQuantum computing - Introduction
Quantum computing - Introduction
 
Quantum computer ppt
Quantum computer pptQuantum computer ppt
Quantum computer ppt
 
Static Code Analysis
Static Code AnalysisStatic Code Analysis
Static Code Analysis
 

Similar to Learning from other's mistakes: Data-driven code analysis

Declarative Data Modeling in Python
Declarative Data Modeling in PythonDeclarative Data Modeling in Python
Declarative Data Modeling in PythonJoshua Forman
 
Sphinx autodoc - automated api documentation - PyCon.MY 2015
Sphinx autodoc - automated api documentation - PyCon.MY 2015Sphinx autodoc - automated api documentation - PyCon.MY 2015
Sphinx autodoc - automated api documentation - PyCon.MY 2015Takayuki Shimizukawa
 
json.ppt download for free for college project
json.ppt download for free for college projectjson.ppt download for free for college project
json.ppt download for free for college projectAmitSharma397241
 
Building DSLs with Xtext - Eclipse Modeling Day 2009
Building DSLs with Xtext - Eclipse Modeling Day 2009Building DSLs with Xtext - Eclipse Modeling Day 2009
Building DSLs with Xtext - Eclipse Modeling Day 2009Heiko Behrens
 
Sphinx autodoc - automated API documentation (EuroPython 2015 in Bilbao)
Sphinx autodoc - automated API documentation (EuroPython 2015 in Bilbao)Sphinx autodoc - automated API documentation (EuroPython 2015 in Bilbao)
Sphinx autodoc - automated API documentation (EuroPython 2015 in Bilbao)Takayuki Shimizukawa
 
Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)
Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)
Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)Takayuki Shimizukawa
 
Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015Takayuki Shimizukawa
 
Python Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard WayPython Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard WayUtkarsh Sengar
 
Demystifying Shapeless
Demystifying Shapeless Demystifying Shapeless
Demystifying Shapeless Jared Roesch
 
Writing a compiler in go
Writing a compiler in goWriting a compiler in go
Writing a compiler in goYusuke Kita
 
Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...Andreas Dewes
 
RedisConf17 - Redis as a JSON document store
RedisConf17 - Redis as a JSON document storeRedisConf17 - Redis as a JSON document store
RedisConf17 - Redis as a JSON document storeRedis Labs
 
descriptive programming
descriptive programmingdescriptive programming
descriptive programmingAnand Dhana
 
AST - the only true tool for building JavaScript
AST - the only true tool for building JavaScriptAST - the only true tool for building JavaScript
AST - the only true tool for building JavaScriptIngvar Stepanyan
 
Academy PRO: Elasticsearch. Data management
Academy PRO: Elasticsearch. Data managementAcademy PRO: Elasticsearch. Data management
Academy PRO: Elasticsearch. Data managementBinary Studio
 

Similar to Learning from other's mistakes: Data-driven code analysis (20)

Declarative Data Modeling in Python
Declarative Data Modeling in PythonDeclarative Data Modeling in Python
Declarative Data Modeling in Python
 
Go之道
Go之道Go之道
Go之道
 
Sphinx autodoc - automated api documentation - PyCon.MY 2015
Sphinx autodoc - automated api documentation - PyCon.MY 2015Sphinx autodoc - automated api documentation - PyCon.MY 2015
Sphinx autodoc - automated api documentation - PyCon.MY 2015
 
json.ppt download for free for college project
json.ppt download for free for college projectjson.ppt download for free for college project
json.ppt download for free for college project
 
Building DSLs with Xtext - Eclipse Modeling Day 2009
Building DSLs with Xtext - Eclipse Modeling Day 2009Building DSLs with Xtext - Eclipse Modeling Day 2009
Building DSLs with Xtext - Eclipse Modeling Day 2009
 
Sphinx autodoc - automated API documentation (EuroPython 2015 in Bilbao)
Sphinx autodoc - automated API documentation (EuroPython 2015 in Bilbao)Sphinx autodoc - automated API documentation (EuroPython 2015 in Bilbao)
Sphinx autodoc - automated API documentation (EuroPython 2015 in Bilbao)
 
Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)
Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)
Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)
 
Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015
 
Python dictionaries
Python dictionariesPython dictionaries
Python dictionaries
 
Python Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard WayPython Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard Way
 
CSV JSON and XML files in Python.pptx
CSV JSON and XML files in Python.pptxCSV JSON and XML files in Python.pptx
CSV JSON and XML files in Python.pptx
 
Demystifying Shapeless
Demystifying Shapeless Demystifying Shapeless
Demystifying Shapeless
 
Oop java
Oop javaOop java
Oop java
 
Writing a compiler in go
Writing a compiler in goWriting a compiler in go
Writing a compiler in go
 
Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...
 
RedisConf17 - Redis as a JSON document store
RedisConf17 - Redis as a JSON document storeRedisConf17 - Redis as a JSON document store
RedisConf17 - Redis as a JSON document store
 
descriptive programming
descriptive programmingdescriptive programming
descriptive programming
 
AST - the only true tool for building JavaScript
AST - the only true tool for building JavaScriptAST - the only true tool for building JavaScript
AST - the only true tool for building JavaScript
 
Python basic
Python basicPython basic
Python basic
 
Academy PRO: Elasticsearch. Data management
Academy PRO: Elasticsearch. Data managementAcademy PRO: Elasticsearch. Data management
Academy PRO: Elasticsearch. Data management
 

More from Andreas Dewes

Fairness and Transparency in Machine Learning
Fairness and Transparency in Machine LearningFairness and Transparency in Machine Learning
Fairness and Transparency in Machine LearningAndreas Dewes
 
Type Annotations in Python: Whats, Whys and Wows!
Type Annotations in Python: Whats, Whys and Wows!Type Annotations in Python: Whats, Whys and Wows!
Type Annotations in Python: Whats, Whys and Wows!Andreas Dewes
 
Analyzing data with docker v4
Analyzing data with docker   v4Analyzing data with docker   v4
Analyzing data with docker v4Andreas Dewes
 
Say "Hi!" to Your New Boss
Say "Hi!" to Your New BossSay "Hi!" to Your New Boss
Say "Hi!" to Your New BossAndreas Dewes
 
Demonstrating Quantum Speed-Up with a Two-Transmon Quantum Processor Ph.D. d...
Demonstrating Quantum Speed-Up  with a Two-Transmon Quantum Processor Ph.D. d...Demonstrating Quantum Speed-Up  with a Two-Transmon Quantum Processor Ph.D. d...
Demonstrating Quantum Speed-Up with a Two-Transmon Quantum Processor Ph.D. d...Andreas Dewes
 
Python for Scientists
Python for ScientistsPython for Scientists
Python for ScientistsAndreas Dewes
 

More from Andreas Dewes (6)

Fairness and Transparency in Machine Learning
Fairness and Transparency in Machine LearningFairness and Transparency in Machine Learning
Fairness and Transparency in Machine Learning
 
Type Annotations in Python: Whats, Whys and Wows!
Type Annotations in Python: Whats, Whys and Wows!Type Annotations in Python: Whats, Whys and Wows!
Type Annotations in Python: Whats, Whys and Wows!
 
Analyzing data with docker v4
Analyzing data with docker   v4Analyzing data with docker   v4
Analyzing data with docker v4
 
Say "Hi!" to Your New Boss
Say "Hi!" to Your New BossSay "Hi!" to Your New Boss
Say "Hi!" to Your New Boss
 
Demonstrating Quantum Speed-Up with a Two-Transmon Quantum Processor Ph.D. d...
Demonstrating Quantum Speed-Up  with a Two-Transmon Quantum Processor Ph.D. d...Demonstrating Quantum Speed-Up  with a Two-Transmon Quantum Processor Ph.D. d...
Demonstrating Quantum Speed-Up with a Two-Transmon Quantum Processor Ph.D. d...
 
Python for Scientists
Python for ScientistsPython for Scientists
Python for Scientists
 

Recently uploaded

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Learning from other's mistakes: Data-driven code analysis

  • 1. Data-driven code analysis: Learning from other's mistakes Andreas Dewes (@japh44) andreas@quantifiedcode.com 13.04.2015 PyCon 2015 – Montreal
  • 2. About Physicist and Python enthusiast CTO of a spin-off of the University of Munich (LMU): We develop software for data-driven code analysis.
  • 4. Tools & Techniques for Ensuring Code Quality static dynamic automated manual Debugging Profiling ... Manual code reviews Static analysis / automated code reviews Unit testing System testing Integration testing
  • 5. Discovering problems in code def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) obj returns only the keys of the dictionary. (obj.items() is needed) value.imaginary does not exist. (value.imag would be correct)
  • 6. Dynamic Analysis (e.g. unit testing) def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) def test_encode(): d = {'a' : 1j+4, 's' : {'d' : 4+5j}} r = encode(d) #this will fail... assert r['a'] == {'type' : 'complex', 'r' : 4, 'i' : 1} assert r['s']['d'] == {'type' : 'complex', 'r' : 4, 'i' : 5}
  • 7. Static Analysis (for humans) encode is a function with 1 parameter which always returns a dict. I: obj should be an iterator/list of tuples with two elements. encode gets called with a dict, which does not satisfy (I). a value of type complex does not have an .imaginary attribute! encode is called with a dict, which again does not satisfy (I). def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)
  • 8. How static analysis tools works (short version) 1. Compile the code into a data structure, typically an abstract syntax tree (AST) 2. (Optionally) annotate it with additional information to make analysis easier 3. Parse the (AST) data to find problems.
  • 9. Python Tools for Static Analysis PyLint (most comprehensive tool) http://www.pylint.org/ PyFlakes (smaller, less verbose) https://pypi.python.org/pypi/pyflakes Pep8 (style and some structural checks) https://pypi.python.org/pypi/pep8 (... and many others)
  • 10. Limitations of current tools & technologies
  • 11. Checks are hard to create / modify... (example: PyLint code for analyzing 'try/except' statements)
  • 14. Our approach 1. Code is data! Let's not keep it in text files but store it in a useful form that we can work with easily (e.g. a graph). 2. Make it super-easy to specify errors and bad code patterns. 3. Make it possible to learn from user feedback and publicly available code.
  • 15. Building the Code Graph def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)
  • 16. dict name name assign functiondef body body targets for body iterator Building the Code Graph def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) value
  • 17. {i : 1} {id : 'e'} {name: 'encode', args : [...]} {i:0} Building the Code Graph def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) e4fa76b... a76fbc41... c51fa291... 74af219... name name assign body body targets for body iterator value dict functiondef $type: dict
  • 18. Example: Tornado Project 10 modules from the tornado project Modules Classes Functions
  • 19. Advantages - Simple detection of (exact) duplicates - Semantic diffing of modules, classes, functions, ... - Semantic code search on the whole tree
  • 20. Describing Code Errors / Anti-Patterns
  • 21. Code issues = patterns on the graph def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) name attribute value attr {id : imaginary} name $type {id : value} complex
  • 22. Using YAML to describe graph patterns def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) node_type: attribute value: $type: complex attr: imaginary
  • 23. Generalizing patterns def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imaginary} return e d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d) node_type: attribute value: $type: complex attr: $not: $or: [real, imagin]
  • 24. Learning from feedback / false positives
  • 25. "else" in for loop without break statement node_type: for body: $not: $anywhere: node_type: break orelse: $anything: {} values = ["foo", "bar", ... ] for i,value in enumerate(values): if value == 'baz': print "Found it!" else: print "didn't find 'baz'!"
  • 26. Learning from false positives (I) values = ["foo", "bar", ... ] for i,value in enumerate(values): if value == 'baz': print "Found it!" return value else: print "didn't find 'baz'!" node_type: for body: $not: $or: - $anywhere: node_type: break - $anywhere: node_type: return orelse: $anything: {}
  • 27. Learning from false positives (II) node_type: for body: $not: $or: - $anywhere: node_type: break exclude: node_type: $or: [while,for] - $anywhere: node_type: return orelse: $anything: {} values = ["foo", "bar", ... ] for i,value in enumerate(values): if value == 'baz': print "Found it!" for j in ...: #... break else: print "didn't find 'baz'!"
  • 28. patterns vs. code handlers: node_type: excepthandler type: null node_type: tryexcept handlers: - body: - node_type: pass node_type: excepthandler node_type: tryexcept (no exception type specified) (empty exception handler)
  • 29. Summary & Feedback 1. Storing code as a graph opens up many interesting possibilities. Let's stop thinking of code as text! 2. We can learn from user feedback or even use machine learning to create and adapt code patterns! 3. Everyone can write code checkers! => crowd-source code quality!

Editor's Notes

  1. First, we transform the code into a so-called „abstract syntax tree“. This is a representation that can be easily manipulated programatically.
  2. First, we transform the code into a so-called „abstract syntax tree“. This is a representation that can be easily manipulated programatically.
  3. First, we transform the code into a so-called „abstract syntax tree“. This is a representation that can be easily manipulated programatically.
  4. We store all syntax trees of the project in a graph database (either on-disk or in-memory) to be able to perform queries on the graph and store it for later analysis. Nodes in modules can be linked, e.g. to point from a function call in a given module to the definition of that function in another module.