The Problem
● Phishingscams in Ethereum involve malicious actors creating fake
addresses or platforms to deceive users into transferring funds to
them.
● The Ethereum blockchain records all transactions, and these are
represented as a graph where each address is a node, and each
transaction is an edge connecting two nodes.
● In this graph, phishing addresses are nodes that represent
fraudulent accounts. The challenge in detecting phishing lies in
understanding the complex relationships between
nodes—specifically how legitimate and phishing addresses are
connected.
● Traditional phishing detection methods often fail to capture the
intricate patterns in these relationships, making it difficult to
distinguish between legitimate and fraudulent addresses.
● We propose a solution based on Graph Convolutional Networks
(GCNs), which are well-suited for handling large, sparse graphs like
Ethereum’s.
3.
Dataset Overview
● Thedataset used in the study contains Ethereum transaction
data, including both legitimate and fraudulent activities.
● With 13.5 million edges (transactions) and 3 million nodes
(addresses) with only about 1,165 phishing nodes, the Ethereum
dataset was highly imbalanced.
● This imbalance makes it challenging for detection models to
identify phishing addresses effectively, as the majority of
addresses are legitimate.
● To tackle the imbalance problem, we carefully curated the node
selection process and re-sampled the illicit node transactions to
ensure that the dataset is more balanced.
● The dataset also highlights the challenge of dealing with sparse
graphs, where most nodes have only a few connections, making
it harder to detect patterns indicative of phishing.
Dataset Nodes Edges Illicit Nodes
Ethereum 2,973,489 13,551,303 1,165
4.
Key Node Properties
Objective:Identify patterns and trends for each node.
Key Features Extracted and Used:
● Indegree:
○ Number of transactions received by the node.
● Outdegree:
○ Number of transactions sent by the node.
● Degree:
○ Total number of transactions in which the node is involved.
● Instrength:
○ Total amount of cryptocurrency received by the node.
● Outstrength:
○ Total amount of cryptocurrency sent by the node.
● Strength:
○ Total amount of cryptocurrency transacted.
● Number of Neighbours:
○ The number of other nodes interacting with this node.
5.
Previous Methods: RiWalk
Whatis RiWalk?
● A random-walk-based embedding method that captures structural
and contextual information of nodes.
● These walks generate feature vectors that represent the node’s
connections and its local environment within the graph.
● RiWalk is chosen over other embedding algorithms, like node2vec,
because it provides high-quality embeddings before training a
neural network or classifiers.
● It is also highly effective in handling large, sparse graphs like
Ethereum’s transaction network.
6.
Embedding Integration Workflow:
Step1: Generate node embeddings using RiWalk.
Step 2: Merge these embeddings with engineered node
features.
Step 3: Input the combined features into classifiers.
7.
Baseline Models -RF and LR
● RiWalk embeddings of the Ethereum transaction graph were fed
into two classic classifiers:
○ Logistic Regression model (linear, predicting the probability
of an address being phishing)
○ Random Forest (50 trees, max_depth=5, max_features=10)
● Logistic Regression achieved 96.8 % overall accuracy but
performed poorly on the rare phishing class—61.5 % precision,
just 13.7 % recall (F1 = 0.225)—meaning it missed over 86 % of
actual phishers.
● Random Forest raised overall accuracy to 97.2 % and phishing
precision to 76.1 % with 23.2 % recall (F1 = 0.355), yet still failed
to detect more than three-quarters of phishing addresses.
8.
Model Accuracy %Precision % Recall % F1-Score
Logistic
Regression
96.8 61.5 13.7 0.225
Random Forest 97.2 76.1 23.2 0.355
Conclusion:
● Both models achieve high weighted accuracy thanks to
the dominant non-phishing class.
● However, neither can reliably recall the minority phishing
nodes—highlighting the need for graph-based approaches
that leverage structural information.
9.
Graph Convolutional Networks
●GCN is a neural network based approach that works well with
graph data and takes a graph as an input
● Matrix Multiplication is the core operation of GCNs
● It tries to capture the features of different nodes in the
surrounding network
● Can be used to extract embeddings which can then be passed
through a neural network or any other ML algorithm
10.
Our Approach
● Justlike the Baseline algorithm, we use GCN to process
the graph structure and get the embeddings
● We use Random Forest and Logistic Regression algorithms
for apples-to-apples comparison
● We also, experiment with Neural Networks for the
classification task
11.
Challenges: Architecture Selectionand Training Time
● Selecting the right number of layers and tuning the
parameters to fetch the most optimal results took a lot of
experimentation and revision efforts
● One major challenge faced while experimentation was the
slow training and testing speed of the model
○ To resolve this, we went deeper into the architecture
and found that for large sparse datasets, the
multiplication of sparse matrices increased compute
time without adding value to the model
Preferred Approach: Mid-sizeGCN + Sparse
Operation
After several experimentations:
● Architecture:
○ 3 convolutional layers
○ batch normalization implemented after conv1 and
conv2
● Training Time:
○ The adjacency matrix representing the graph is
converted into a sparse tensor and sparse operations
were applied for memory efficiency and faster
computation
15.
Results for SparseOperations
● Reduced training and testing time by 16% in a
GPU Environment.
● Did not have much effect on the scores
16.
Evaluating Performance byComparing
RiWalk and GCN Embeddings
RiWalk + RF RiWalk + LR GCN + RF GCN + LR
Test F1-Score 0.36 0.23 0.62 0.57
Test Accuracy 97.2% 96.8% 97% 96.5%
Though in terms of accuracy, RiWalk Embedding might outperform GCN
Embeddings, however the more important metric to consider in this binary
classification problem is the F1-Score since it represents the overall
performance of the model over both the classes, thereby handling the
imbalance in the dataset as well
17.
Best Model Performance- GCN + NN
We pass on the GCN embeddings to a neural network consisting
of a ReLU activation function followed by a sigmoid function to
classify our nodes
18.
Conclusion
● Understood theproblem of phishing in
cryptocurrencies and how efficient methods are
required for global large scale adoption of crypto
● Tackled imbalance problem using the sampling
techniques
● Explored RiWalk and GCN techniques to tackle
graph-based challenges
● GCN + NN outperforms even though RiWalk gave
better accuracy
● The use of sparse operations reduced the time by
a factor of 16% in a TPU environment