2019 GDRR: Blockchain Data Analytics - ChainNet: Learning on Blockchain Graphs with Topological Features - Cunyet Gurcan Akcora, October 6, 2019

ChainNet: Learning on Blockchain Graphs with
Topological Features
Cüneyt Gürcan Akçora
Joint work with N. C. Abay, U. Islambekov, Y. Tian, B. Thuraisingham,
Yulia R. Gel, Murat Kantarcioglu
Depts. of Statistics and Computer Science
University of Texas at Dallas
October 6, 2019. Durham, North Carolina.
SAMSI Workshop on Foundations of Blockchain Data Analytics

Cuneyt Gurcan Akcora
2
Outline
• Blockchain transactions and the network
• A brief introduction to chainlets
• Graph features
• Occurrence and amount matrices
• Graph filtration
• Topological features
• Persistent homology
• Betti numbers on blockchains
• Betti derivatives
• Experiments

3
Starting point: Can the blockchain activity of
addresses in transactions and blocks be useful to
detect and predict phenomena on blockchains?
First question: how do we model on-chain data?

4
Transaction output (TXO) based blockchains
3B
0.8B
2B
Transaction 1
Address
0.2B tx fee
Next, if address b wants to spend its received 2B, it needs to show proof of funds:
“Use the 2B I received from Block 1, transaction 1 and to pay 1.5B to c and 0.3B to d”.
a
b
3B
0.8B
2B
2B
1.5B
0.3B
c
d
b
0.8 bitcoin
2 bitcoins
a
b
time

5
Blockchain Graph – Substructure mining
Definition [K-Chainlets]:
Let k-chainlet Gk = (Vk, Ek, B) be a substructure of G with k nodes of type {Transaction}.
If there exists an isomorphism between Gk and G’, G’ ∈ G, we say that there exists an
occurrence, or embedding of Gk in G.
If a Gk occurs more/less frequently than expected by chance, it is called
a Blockchain k-chainlet. A k-chainlet signature fG(Gk) is the number of occurrences of
Gk in G.
• Rather than individual edges or nodes, we use a
substructure as the building block in our Bitcoin
analysis.
• We use the term chainlet to refer to such
substructures.

6
Blockchain Chainlets
• Chainlets have distinct shapes that reflect
their role in the network.
• We aggregate these roles to analyze network
dynamics.
Tx 1
Tx 1
Tx 2
Tx 2
Tx 3
Tx 3
Tx 4
Tx 4
Three distinct types of 1-chainlets!

7
Graph features – occurrence and amount
• Occurrence: how many transactions were created with a given type of chainlet?
• Amount: how many bitcoins were transferred with a given type of chainlet?
Coins are yet unspent.
Percentages of Bitcoin chainlets (occ.)

8
• Let’s take one-to-two chainlets, i.e., c1→2
Occurrence: 2 chainlets occur: t1 and t3.
Amount: 7.8Ƀ+6.6Ƀ=14.4 bitcoins are transferred by them.
• We can record one occurrence matrix, and one amount
matrix per 24 hour window.

9
𝟐 𝟐 𝟎
𝟏 𝟎 𝟏
𝟎 𝟎 𝟎
𝟐. 𝟑 + 𝟎. 𝟒 𝟕. 𝟖 + 𝟔. 𝟔 𝟎
𝟏. 𝟑 𝟎 𝟏. 𝟏
𝟎 𝟎 𝟎
Occurrence matrix Amount matrix
• Rows and columns indicate number of inputs and outputs, respectively.
• Information about one-to-two chainlets, i.e., c1→2, are given in the matrix cell [1,2]

10
Graph filtration – combining amount and occurrence
• Given the amount and occurrence information, a natural combination of them
entails
1. filtering the occurrence matrix with user defined thresholds on amounts,
2. filtering the amount matrix with user defined thresholds on occurrences.
• In both cases, the user defined threshold implies a heuristic aspect.
• We chose the first solution. At a given time period t,
1. chainlets of the time period are iterated over with a set of amount thresholds,
2. a chainlet c1→2’s occurrence is recorded in the associated occurrence matrix Oɛ
if the amount transferred by the chainlet amount(c1→2) ≥ ɛ
• Resulting set of matrices is used as the features of the 24 hour window.

11
Topological features
• Topological Data Analysis (TDA) provides methods to systematically study the
topological and geometric structure underlying data.
• These structures are commonly analyzed via the multi-scale-based framework of
persistent homology.
• The primary idea is to assess which topological features remain persistent over a
larger set of scales and hence are likely to play a significant role in its functionality.

12
Topological features
• Let X = {X1,…, Xn} be a set of data points in a metric space (e.g., the Euclidean space).
• Select a scale ɛk and form a graph Gk with the associated adjacency matrix 𝐴 =
1dij≤ ɛk
, where dij is the distance between points Xi and Xj .
• Changing the scale values ɛ1 < ɛ2 <… < ɛN results in a hierarchical nested sequence of
graphs G1 ⊆ G2 ⊆…⊆ GN that is called a graph filtration.
• Next, to be able to glean the intrinsic geometry underlying the data from the graph
filtration, we associate an (abstract) simplicial complex with each Gk,
k = 1, …, N.

13
Simplicial complexes
• The Vietoris-Rips (VR) simplicial complex is one of the most popular choices in TDA
due to its easy construction and computational advantages.
• A Vietoris-Rips complex at scale ɛ, denoted by VRɛ, consists of all k-element subsets
of X = {X1,…, Xn}, called (k − 1)-simplices, k = 1,…, K, whose points are pairwise within
distance of ɛ.
• A 0-simplex can be identified with a point, a 1-simplex with a segment, a 2-simplex
with a triangle and a 3-simplex is with a tetrahedron and so on.
• Armed with the associated VR filtration, VR1 ⊆ VR2 ⊆…⊆ VRN, we can track
qualitative topological features such as connected components, loops and voids that
appear and disappear as we move along the filtration.

14
Simplicial complexes
• We use the Betti sequences as summaries of persistent homology calculations
which encode the counts of these features at increasing scale values.
• Their individual elements are called the Betti numbers that are computed for each
value of the scale:
βp = (βp(ɛ1), βp(ɛ2),…, βp(ɛN )), p = 0, 1,…, K, where βp(ɛk) is the p-th Betti number of the
simplicial complex at scale ɛk.
• The Betti numbers for small p have a simple interpretation. For instance, β0 is the
number of connected components; β1 is the number of loops; β2 is the number of
voids.

15
Persistent homology for blockchains
• Betti sequences provide a non-parametric solution, but the computational
complexity of Betti calculations prohibits their usage in large networks.
• For example, for simplicial complexes of dimension 2, “currently no upper bound
better than a constant times n3 is known” (Edelsbrunner, 2014).
• Every day brings more than 500K new addresses to the Bitcoin network.
• Betti number computations on such large networks is unfeasible.

16
• We propose a novel approach that computes the Betti sequences on a network of
N × N nodes where N is the size of the amount matrix A.
• For each of the N2 unique chainlets (e.g., C2→3), we create a node in a new
network, where edge distance between two nodes is computed with a suitable
‘distance’ d.
A graph from N2= 400 chainlet nodes
In constructing the new network, we use and
hence retain the amount information from the
Blockchain network.
This way, we combine distance (computed from
transferred coins) with edge connectedness
while restricting the network size.

17
We describe the main steps as follows:
Given a heterogeneous Blockchain network with transferred bitcoins on edges,
1. All the transferred amounts are converted from Satoshis to bitcoins and log
transformed: a’ = log(1 + a/108).
2. For each chainlet of a given time period, we compute the sample q-quantiles
for the associated log-transformed amounts:
a k-th q-quantile, k = 0, 1,…, q, is the amount Q(k) such that
𝑖=1
τ 1 𝑦 𝑖
< 𝑄(k)≈
τk
𝑞
and 𝑖=1
τ 1 𝑦 𝑖
> 𝑄(k)≈
τ(q−k)
𝑞
, where τ is the total
number of transactions.

18
3. The (dis)similarity metric dij between chainlet nodes i and j is defined as the
quantile-based distance
dij =
𝑘=0
𝑞
𝑄𝑖 𝑘 − 𝑄𝑗 𝑘 2
4. We construct a sequence of scales ɛ1 < ɛ2 < . . . < ɛS covering a range of
distances during the entire 365-day period.
5. For each ɛk, we build the corresponding VR complex whose 0-simplices are
single chainlets and 1- simplices are pairs of chainlets with distance ≤ ɛk.
As a result, we obtain the filtration of VR complexes VR1 ⊆ VR2 ⊆… ⊆ VRS
We then compute xt = {β0(ɛ1), . . . , β0(ɛS); β1(ɛ1),… , β1(ɛS)}.

19
Betti derivatives
• The graph of the p-th Betti sequence is often referred to as the p-th Betti curve.
• Analysis of the Betti curves allows us to assess dynamics of essential topological
features as a function of the scale.
• To assess the rate of changes in topological features of the Blockchain graph, we
introduce a novel concept of Betti derivatives up to order l> 0 on VR filtrations:
∆lβp(ɛk) = ∆l-1
βp(ɛk+1) − ∆l-1
βp(ɛk) , where k = 1, 2,…, S-1, p = {0, 1,…} values are
determined by how many Betti numbers we choose to use, and S is the
number of filtration steps.

20
Experiments
• Problem Statement: Let xt ∈ R d be a set of features computed on the Bitcoin
blockchain.
Let (x1, y1),... ,(xt, yt) be the observed data where Y = {y1,..., yt} are the
corresponding Bitcoin prices in dollars. At a time point t, estimate the Bitcoin
price yt’ where t’ > t.
• We downloaded and parsed the entire Bitcoin transaction graph from 2009
January to 2018 December.
• Using a time interval of 24 hours, we extracted daily transactions on the
network and created the Bitcoin graph.
• Our Bitcoin price (USD) data is downloaded from blockchain.com which
aggregates prices from worldwide online exchanges.

21
Experiments
• In addition to FL and Betti related features: past price, transaction count, mean
degree of addresses, number of new addresses, mean and total coin amount
transferred in transactions and address network average clustering coefficient.
• We used ARIMAx, Random Forest, XGBT, Gaussian Process based Regression,
and Elastic Net.
• We use a time window based approach in price prediction.

22
Baseline Experiments
• We assess model performance with root mean squared error in the predicted
price.
Window = 3 Window = 5 Window = 7
• The simplest baseline for ChainNet can be constructed by training models on past
price and past total transaction count in a sliding window prediction scheme.

23
Experiments
• We report the percentage predictive gain, or decrease in RMSE for a specific machine
learning model m w.r.t. its baseline model m0 as
∆ 𝑚(𝑤, ℎ) = 100 × 1 − (𝑅𝑀𝑆𝐸 𝑚(𝑤, ℎ)/𝑅𝑀𝑆𝐸 𝑚0
(𝑤, ℎ)),
where 𝑅𝑀𝑆𝐸 𝑚0
(w, h) and RMSEm(w, h) are delivered by a baseline model m0 and a
competing model m, respectively.
Window = 3 Window = 5 Window = 7
Figure: Gain over the best model, XGBT, is given for three windows, and multiple horizons.

24
Conclusions
• We achieve the best results with a training length of 100 days. Too much history is
bad.
• An important result is that next day predictions (h = 1) do not improve significantly
(i.e., at most 2%) with ChainNet features.
• Our heuristic approach, FL, has an interesting trend; its usage in models lead to better
gains for higher horizons. On the other hand, Betti models achieve better gain values
for short horizons.
• Our results on the full Bitcoin network show that in less than 7 day ahead predictions,
topological models bring a prediction gain of almost 40% over baseline approaches

Thanks for attending!
Cuneyt.Akcora@UManitoba.ca
See our survey “Blockchain: A Graph Primer”:
…. Without assuming any reader expertise, our aim is to provide a concise but
complete description of the Blockchain technology.
https://arxiv.org/abs/1708.08749

2019 GDRR: Blockchain Data Analytics - ChainNet: Learning on Blockchain Graphs with Topological Features - Cunyet Gurcan Akcora, October 6, 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2019 GDRR: Blockchain Data Analytics - ChainNet: Learning on Blockchain Graphs with Topological Features - Cunyet Gurcan Akcora, October 6, 2019

Similar to 2019 GDRR: Blockchain Data Analytics - ChainNet: Learning on Blockchain Graphs with Topological Features - Cunyet Gurcan Akcora, October 6, 2019 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

2019 GDRR: Blockchain Data Analytics - ChainNet: Learning on Blockchain Graphs with Topological Features - Cunyet Gurcan Akcora, October 6, 2019