Plume - A Code Property Graph Extraction and Analysis Library

Plume
A Code Property Graph Extraction and
Analysis Library
1
S.D. Baker Effendi, A.B. van der Merwe, & W. Visser
Stellenbosch University
Using Code Property Graphs and Pushdown
Systems for Static Analysis

| GRAPHAIWORLD.COM | #GRAPHAIWORLD |
❏ Introduction to Plume
❏ Background
❏ Code Property Graph
❏ Data-Flow Analysis
❏ Pushdown Systems
❏ How Plume works
❏ The future of Plume
2
Overview

❏ Plume is an open-source, static
analysis library
❏ A code property graph is extracted
from JVM bytecode
❏ This code property graph is stored in a
graph database backend
❏ Data-ﬂow analysis is run on the graph
database by using graph queries
❏ Written using Kotlin which is
interoperable with Java
3
Introduction

The Code Property Graph
F Yamaguchi, et al. introduced the code property graph (CPG) that merges the
❏ abstract syntax tree (AST),
❏ control ﬂow graph (CFG), and
❏ program dependence graph (PDG)
into a joint data structure.
5
Illustration of a code property graph from the original paper “Modeling and Discovering Vulnerabilities with Code Property Graphs”

The Code Property Graph
❏ The CPG is independent of the
programming language
❏ Software vulnerabilities can be
identiﬁed from the CPG
❏ Graph patterns of known
vulnerabilities are then matched
❏ ShiftLeft have commercialized the
CPG for DevSecOps
6
Illustration of a CPG projection from ShiftLeft.io
Yamaguchi, Fabian, et al. "Modeling and discovering vulnerabilities with code property graphs." 2014 IEEE Symposium on Security and Privacy. IEEE, 2014.

Data-Flow Analysis
❏ Data-flow analysis is a technique for
gathering information about the
possible set of values calculated at
various points in a program
❏ The control flow graph is used to
determine where a particular value
might propagate
7
Sagiv, Mooly, Thomas Reps, and Susan Horwitz. "Precise interprocedural dataflow analysis with applications to constant propagation." Theoretical Computer Science 167.1-2 (1996): 131-170.
The supergraph is annotated with the dataflow functions for the “possibly-
uninitialized variables” problem.

Data-Flow Analysis
❏ A procedure is a small section of a program
that performs a specific task
❏ Intraprocedural analysis looks at analyzing a
single procedure
❏ Interprocedural analysis uses calling
relationships among multiple procedures
❏ Example analysis’ are:
❏ reaching definitions
❏ liveness analysis
❏ constant propagation
8
Reps, Thomas, Susan Horwitz, and Mooly Sagiv. "Precise interprocedural dataflow analysis via graph reachability." Proceedings of the 22nd ACM SIGPLAN-SIGACT symposium on Principles of
programming languages. 1995.
The exploded super-graph that corresponds to the instance of the
possibly-uninitialized variables problem shown in the last figure.

Data-Flow Analysis
❏ Reps, Horwitz and Sagiv introduced
frameworks for a general way of
solving these problems in polynomial
time
❏ E Bodden created a generic IFDS/IDE
solver on top of Soot
❏ This was able to implement a wider
range of analysis such as typestate
and information-flow
9
Bodden, Eric. "Inter-procedural data-flow analysis with IFDS/IDE and Soot." Proceedings of the ACM SIGPLAN International Workshop on State of the Art in Java Program analysis. 2012.
Exploded super-graph for an IFDS information-flow analysis.

Soot
❏ Soot is a Java optimization framework
originally developed by the Sable Research
Group of McGill University
❏ Soot provides a range of analysis such as:
❏ call-graph construction
❏ points-to analysis
❏ data-ﬂow analysis with IFDS/IDE
❏ Soot transforms programs into an intermediate
representation (IR) which is then analyzed
10
Soot - A framework for analyzing and transforming Java and Android applications https://soot-oss.github.io/soot

IDE in Typestate Analysis
❏ Typestates define valid sequences of operations that can be performed upon
an instance of a given type
❏ Aliasing refers to the situation where the same memory location can be
accessed using different names
❏ Späth, et. al. presented an alias-aware extension on the IDE framework with
IDEal
which improved upon the efficiency and precision of typestate analysis
11
File a = new File();
File b = a;
b.open();
a.close();
Späth, J., Ali, K., & Bodden, E. (2017). IDEal: efficient and precise alias-aware dataflow analysis. Proc. ACM Program. Lang., 1(OOPSLA), 99-1.

Data-Flow Analysis Limitations
Rice’s theorem
Any non-trivial, semantic property of a program is
undecidable.
A semantic property concerns a program’s behaviour
e.g. does a program terminate for all inputs?
To ensure an analysis terminates, we need to put a
boundary on the data-ﬂow domain but ultimately leads
to imprecision. One technique is by limiting
ﬁeld-sequence access paths to length k.
12
If we have an algorithm that decides a non-trivial property, we can
construct a Turing machine that decides the halting problem.
By Booyabazooka - Own work, Public Domain,
https://commons.wikimedia.org/w/index.php?curid=5407483

Pushdown Systems
❏ A pushdown automata (PDS) is a
ﬁnite-state automata with extra memory
called a stack
❏ Each state is called a control location
❏ This class of automata recognize Context
Free Languages (CFL)
❏ A CFL is generated by a context free
grammar (CFG)
13
A diagram of a pushdown automaton.
By Jochgem - Own work, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=4983792

Pushdown Systems
❏ Context-, and field-sensitivity can be
expressed using CFL reachability problems
❏ Späth, et. al. introduced the notion of
synchronized pushdown systems (SPDS)
to efficiently solve any single
CFL-reachability problem
❏ An SPDS is a combination of two
flow-sensitive automata; a call-PDS and a
field-PDS
14
Späth, Johannes, Karim Ali, and Eric Bodden. "Context-, flow-, and field-sensitive data-flow analysis using synchronized pushdown systems." Proceedings of the ACM on
Programming Languages 3.POPL (2019): 1-29.
A points-to analysis can be formulated by the reachability problem under the
following Dyck Language:
Yuan, Hao, and Patrick Eugster. "An efficient algorithm for solving the
dyck-cfl reachability problem on trees." European Symposium on
Programming. Springer, Berlin, Heidelberg, 2009.

Pushdown System of Calls
15
Data-ﬂow example for a simple recursive program.
Automaton computed with the post* algorithm.
The structure of a call-PDS:
❏ Control locations are program
variables
❏ The stack alphabet is the set of
program statements
❏ The rule set models the data-ﬂow
effect of a variable at a statement
This automaton provides
context-sensitivity.

Pushdown System of Fields
The structure of a field-PDS:
❏ Control locations is a pair of a variable
and a statement
❏ The stack alphabet is the set of all fields
of a program
❏ The rule set models the data-flow
within the access paths
This automaton provides field-sensitivity.
16
Data-flow example for a simple if-else statement with field accesses.
Automaton computed with the post* algorithm.

Synchronized Pushdown Systems
17
Pushdown System of Calls Pushdown System of Fields SPDS
Flow sensitive ✔ ✔ ✔
Context-Sensitive ✔ ✘ ✔
Field-Sensitive ✘ ✔ ✔
Both pushdown systems can answer reachability queries and handle recursive
structures.
Each PDS has a precision advantage over the other so by combining them we
get the precision beneﬁts of both.

SPDS Advantages
18
❏ The PDA of fields is a concise and finite
representation of (potentially infinitely many)
access paths
❏ No need to resort to k-limiting - preserves
precision!
❏ In pointer-analysis, SPDS avoids exponential
growth of the abstract domain by using
PDS-based encoding
❏ Typestate information can be encoded as
weights to any of the PDAs
A PDA of fields and its finite representation of an infinite set of
access paths.

SPDS Limitations
19
SPDS over-approximates in corner cases where a
context-insensitive data-flow path occurs at the
same time as a field-sensitive path or vice versa.
These are typically only during synthetic examples
and, based on Späth, et. al.’s empirical evaluation,
these situations do not arise in practice.
Thus, an improperly matched call site does not
induce a properly matched field access.

Back to Plume
| GRAPHAIWORLD.COM |

Features of Plume
21
Code Property Graph
+
Synchronized Pushdown Systems
+
Graph Database
=
❏ Language independent analysis on the CPG
❏ Provides flow-, context-, field- sensitive and
alias-aware dataflow analysis
❏ Provides the ability to perform static analysis
incrementally and store results in the graph
database
❏ Partial updates to the CPG when
source-code is updated
❏ Scales for large programs by leveraging a
graph database backend

How does Plume work?
22
Plume is a Kotlin library divided into 3 parts
❏ Driver: connects to the database of choice
❏ Extractor: creates a CPG from bytecode
❏ Analyser: performs data-ﬂow analysis on the CPG
The three parts represent the separation of concerns between the different
stages and requirements of the CPG driven analysis pipeline.
Connect to Graph Database Extract Code Property Graph
Graph Icons from graph theory tree by Ecem Afacan from the Noun Project
Analyze Code Property Graph
.java
.py
.js

How does Plume communicate?
23
Plume’s driver aims to be graph
database agnostic in order to
eventually benchmark all supported
graph databases in the application of
data-ﬂow analysis against each other.
The driver provides a generic interface
with which the extractor and analyzer
are to interact with.
There are more graph databases to be
supported in the future.
<<interface>>
IDriver
+ exists(PlumeVertex): boolean
+ addVertex(PlumeVertex)
+ addEdge(PlumeVertex, PlumeVertex, EdgeType)
...
TinkerGraph JanusGraph TigerGraph Amazon Neptune

Plume’s Extraction Process
24
❏ Soot is used to convert JVM bytecode to an IR called Jimple
❏ Jimple is based on three-address code and only uses 15 different
operations
❏ Jimple is then converted into Soot’s UnitGraph and CallGraph objects
❏ The extractor converts these two objects into a code property graph
❏ Plume supports compiling Python 2.7 and JavaScript 1.7 into JVM bytecode
using Jython and Mozilla Rhino respectively
Convert source code to class ﬁles
.java
.py
.js
.class .jimple
Extract Jimple and graphs using Soot
Graph Icons from graph theory tree by Ecem Afacan from the Noun Project
Store CPG in database

Example
25
package intraprocedural.basic;
public class Basic1 {
public static void main(String[] args) {
int a = 3;
int b = 2;
int c = a + b;
}
}

Example
26
package intraprocedural.conditional;
public class Conditional1 {
public static void main(String[] args) {
int a = 1;
int b = 2;
if (a > b) {
a -= b;
b -= b;
} else {
b += a;
}
}
}

What Plume can do
❏ Generate an intraprocedural
code property graph
❏ Connect to TinkerGraph,
JanusGraph, TigerGraph, and
Amazon Neptune
❏ Compile Java, Python 2.7 and
JavaScript 1.7 code
Plans for Plume
❏ Add interprocedural edges
❏ Include Neo4j
❏ Perform interprocedural
data-ﬂow analysis algorithms
❏ Investigate soundness of
analysis for dynamic vs static
languages
❏ Investigate the use of GCNNs
for vulnerability detection
27
Plume Roadmap

Try it out!
Examples Repository
https://github.com/plume-oss/plume-examples
Documentation
https://plume-oss.github.io/plume-docs/
Plume GitHub
https://github.com/plume-oss

Plume - A Code Property Graph Extraction and Analysis Library

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Plume - A Code Property Graph Extraction and Analysis Library

Similar to Plume - A Code Property Graph Extraction and Analysis Library (20)

More from TigerGraph

More from TigerGraph (20)

Recently uploaded

Recently uploaded (20)

Plume - A Code Property Graph Extraction and Analysis Library