Today, we almost exclusively think of code in software projects as a collection of text files. The tools that we use (version control systems, IDEs, code analyzers) also use text as the primary storage format for code. In fact, the belief that “code is text” is so deeply ingrained in our heads that we never question its validity or even become aware of the fact that there are other ways to look at code.
In my talk I will explain why treating code as text is a very bad idea which actively holds back our understanding and creates a range of problems in large software projects. I will then show how we can overcome (some of) these problems by treating and storing code as data, and more specifically as a graph. I will show specific examples of how we can use this approach to improve our understanding of large code bases, increase code quality and automate certain aspects of software development.
Finally, I will outline my personal vision of the future of programming, which is a future where we no longer primarily interact with code bases using simple text editors. I will also give some ideas on how we might get to that future.
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Code is not text! How graph technologies can help us to understand our code better.
1. Code Is Not Text!
How graph technologies can help us to
understand our code better
Andreas Dewes (@japh44)
andreas@quantifiedcode.com
21.07.2015
EuroPython 2015 – Bilbao
2. About
Physicist and Python enthusiast
We are a spin-off of the
University of Munich (LMU):
We develop software for data-driven code
analysis.
5. Our Journey
1. Why graphs are interesting
2. How we can store code in a graph
3. What we can learn from the graph
4. How programmers can profit from this
6. Graphs explained in 30 seconds
node / vertex
edge
node_type: classsdef
name: Foo
label: classsdef
data: {...}
node_type: functiondef
name: foo
Old idea, many new solutions: Neo4j, OrientDB, ArangoDB, TitanDB, ... (+SQL, key/value stores)
7. Graphs in Programming
Used mostly within the
interpreter/compiler.
Use cases
• Code Optimization
• Code Annotation
• Rewriting of Code
• As Intermediate Language
8. Building the Code Graph
def encode(obj):
"""
Encode a (possibly nested)
dictionary containing complex values
into a form that can be serialized
using JSON.
"""
e = {}
for key,value in obj.items():
if isinstance(value,dict):
e[key] = encode(value)
elif isinstance(value,complex):
e[key] = {'type' : 'complex',
'r' : value.real,
'i' : value.imag}
return e
dict
name
name
assign
functiondef
body
body
targets
for
body iterator
value
import ast
tree = ast.parse(" ")
...
9. Storing the Graph: Merkle Trees
https://en.wikipedia.org/wiki/Merkle_tree
https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
https://en.bitcoin.it/wiki/Protocol_documentation#Merkle_Trees
/
4a7ef...
/flask
79fe4...
/docs
a77be...
/docs/conf.py
9fa5a../flask/app.py
7fa2a..
...
...
tree
blob
Example: git
(also Bitcoin)
10. {i : 1}
{id : 'e'}
{name: 'encode',
args : [...]}
{i:0}
AST Example
e4fa76b...
a76fbc41...
c51fa291...
name
name
assign
body
body
targets
for
body iterator
value
dict
functiondef
{i : 1}
{id : 'f'}
{i:0}
5afacc...
ba4ffac...
7faec44...
name
assign
body body
targets
value
dict
functiondef
{name: 'decode',
args : [...]}
74af219...
12. What this enables
• Store everything, not just condensed
meta-data (like e.g. IDEs do)
• Store multiple projects together, to
reveal connections and similarities
• Store the whole git commit history of a
given project, to see changes across
time.
15. Querying & Navigation
1. Perform a query over some indexed field(s)
to retrieve an initial set of nodes or edges.
graph.filter({'node_type' : 'functiondef',...})
2. Traverse the resulting graph along its edges.
for child in node.outV('body'):
if child['node_type'] == ...
16. Examples
Show all symbol names, sorted by usage.
graph.filter({'node_type' : {$in : ['functiondef','...']}})
.groupby('name',as = 'cnt').orderby('-cnt')
index 79
...
foo 7
...
bar 5
17. Examples (contd.)
Show all versions of a given function.
graph.get_by_path('flask.helpers.url_for')
def url_for(endpoint, **values):
"""Generates a URL to the given endpoint with the method provided.
Variable arguments that are unknown to the target endpoint are appended
to the generated URL as query arguments. If the value of a query
argument
is ``None``, the whole pair is skipped. In case blueprints are active
you can shortcut references to the same blueprint by prefixing the
local endpoint with a dot (``.``).
This will reference the index function local to the current blueprint::
url_for('.index')
def url_for(endpoint, **values):
"""Generates a URL to the given endpoint with the method provided.
Variable arguments that are unknown to the target endpoint are appended
to the generated URL as query arguments. If the value of a query
argument
is ``None``, the whole pair is skipped. In case blueprints are active
you can shortcut references to the same blueprint by prefixing the
local endpoint with a dot (``.``).
This will reference the index function local to the current blueprint::
url_for('.index')
def url_for(endpoint, **values):
"""Generates a URL to the given endpoint with the method provided.
Variable arguments that are unknown to the target endpoint are appended
to the generated URL as query arguments. If the value of a query
argument
is ``None``, the whole pair is skipped. In case blueprints are active
you can shortcut references to the same blueprint by prefixing the
local endpoint with a dot (``.``).
This will reference the index function local to the current blueprint::
url_for('.index')
def url_for(endpoint, **values):
"""Generates a URL to the given endpoint with the method provided.
Variable arguments that are unknown to the target endpoint are appended
to the generated URL as query arguments. If the value of a query
argument
is ``None``, the whole pair is skipped. In case blueprints are active
you can shortcut references to the same blueprint by prefixing the
local endpoint with a dot (``.``).
This will reference the index function local to the current blueprint::
url_for('.index')
fa7fca...
3cdaf...
19. Example: Code Complexity
Graph Algorithm for Calculating the
Cyclomatic Complexity (the Python variety)
node = root
def walk(node,anchor = None):
if node['node_type'] == 'functiondef':
anchor=node
anchor['cc']=1 #there is always one path
elif node['node_type'] in
('for','if','ifexp','while',...):
if anchor:
anchor['cc']+=1
for subnode in node.outV:
walk(subnode,anchor = anchor)
#aggregate by function path to visualize
The cyclomatic complexity is a quantitative measure of the number of linearly
independent paths through a program's source code. It was developed by
Thomas J. McCabe, Sr. in 1976.
28. {i : 1}
{id : 'e'}
{name: 'encode',
args : [...]}
{i:0}
Basic Problem: Tree Isomorphism (NP-complete!)
name
name
assign
body
body
targets
for
body iterator
value
dict
functiondef
{i : 1}
{id : 'ee'}
{name: '_encode',
args : [...]}
{i:0}
name
name
assign
body
body
targets
for
body iterator
value
dict
functiondef
29. Similar Problem: Chemical Similarity
https://en.wikipedia.org/wiki/Epigallocatechin_gallate
Epigallocatechin gallate
Solution(s):
Jaccard Fingerprints
Bloom Filters
...
Benzene
30. Applications
Detect duplicated code
e.g. "Duplicate code detection using anti-unification", P Bulychev et. al.
(CloneDigger)
Generate semantic diffs
e.g. "Change Distilling:Tree Differencing for Fine-Grained Source Code
Change Extraction", Fluri, B. et. al.
Detect plagiarism / copyrighted code
e.g. "PDE4Java: Plagiarism Detection Engine For Java Source Code: A
Clustering Approach", A. Jadalla et. al.
31. Example: Semantic Diff
@mock.patch('django.db.migrations.questioner.MigrationQuestioner.ask_not_null_alteration',
return_value='Some Name')
def test_alter_field_to_not_null_oneoff_default(self, mocked_ask_method):
"""
#23609 - Tests autodetection of nullable to non-nullable alterations.
"""
class CustomQuestioner(...)
# Make state
before = self.make_project_state([self.author_name_null])
after = self.make_project_state([self.author_name])
autodetector = MigrationAutodetector(before, after, CustomQuestioner())
changes = autodetector._detect_changes()
self.assertEqual(mocked_ask_method.call_count, 1)
# Right number/type of migrations?
self.assertNumberMigrations(changes, 'testapp', 1)
self.assertOperationTypes(changes, 'testapp', 0, ["AlterField"])
self.assertOperationAttributes(changes, "testapp", 0, 0, name="name", preserve_default=False)
self.assertOperationFieldAttributes(changes, "testapp", 0, 0, default="Some Name")
32. Summary: Text vs. Graphs
Text
+ Easy to write
+ Easy to display
+ Universal format
+ Interoperable
- Not normalized
- Hard to analyze
Graphs
+ Easy to analyze
+ Normalized
+ Easy to transform
- Hard to generate
- Not (yet) interoperable
The Future(?): Use text for small-scale manipulation of code,
graphs for large-scale visualization, analysis and transformation.
Advantages of treating code as data, not text:
- Semantic diffs (what changed on the semantic level?)
- Analysis & refactoring (editing graphs, not pieces of text)
- Finding code duplicates
- Finding copyrighted code
- Visualizing code structure
- Storing code efficiently
- Separating the content from the text
- Show code as text
- Show code as intermediate representation (AST)
First, we transform the code into a so-called „abstract syntax tree“. This is a representation that can be easily manipulated programatically.
First, we transform the code into a so-called „abstract syntax tree“. This is a representation that can be easily manipulated programatically.
We store all syntax trees of the project in a graph database (either on-disk or in-memory) to be able to perform queries on the graph and store it for later analysis. Nodes in modules can be linked, e.g. to point from a function call in a given module to the definition of that function in another module.
First, we transform the code into a so-called „abstract syntax tree“. This is a representation that can be easily manipulated programatically.