Detection of fraud in financial blockchain-based transactions through big data analytics

Detection of fraud in financial blockchain-based
transactions through big data analytics
Jessica Páez Bonilla
Director: Jose Maria Álvarez Rodr´ıguez
Universidad Carlos III de Madrid
Master in Big Data Analytics
2017-2018
July 11,2018
Jessica Páez Bonilla (UC3M) Master Thesis July 11,2018 1 / 27

Overview
1 Introduction
2 Project Objectives
3 System Design
4 Implementation
5 Experiment
6 Project Budget and Plan
7 Legal Framework and socio-economic environment
8 Conclusions and Future works

Introduction
Using analytical techniques -data gathering, preprocessing, and model
building- it could be possible to detect and prevent ﬁnancial fraud.
The aim to describe complex fraud in terms of patterns suitable for
system-driven detection and analysis.
Network analysis can provide useful insight into large datasets based
on the interconnectedness of the agents in the network being
analyzed.

Introduction
Network: shows relationships among the blockchain users and ﬂux of
money. It enables the fraud patterns discovery.
Network graph analysis oﬀers a method for capturing the context
of fraud in a standard, machine readable and transferable format.
Associations learned from visually observing fraudulent transactions,
could be used as knowledge input to create analytical models.

Project Objectives
1 Research techniques used for fraud detection and explore blockchain
data.
2 Design a system that could take into account the patterns
surrounding the fraudulent transactions.
3 Implement a system using big data analytic tools like R and Python.
4 Experiment and validate the designed system.

System Overview

System Design - Network Metrics
Metric Interpretation
Degree Inﬂuence on the network
Closeness How quick is the access to other nodes in the network
Betweeness Node location. Is it in the shortest path to other nodes?
Density Level of linkage among the nodes
Modularity How modular the network is

Implementation - Technology used
BigQuery, R
(igraph) and
Python have been
used in the
development of
this system.
Table 1: Used Packages Versions
Package Used Version
matplotlib 1.5.1
pandas 0.19.2
networkx 1.11
community 0.9
numpy 1.11.3
scipy 0.18.1

Experiment - Steps
1 Data Exploration.
2 Network metrics and extraction of communities.
3 Features and ML algorithms selection.
4 Performance Measures.
5 Execution.
6 Analysis of Results.
7 Experiment Limitations.

Experiment - 1. Data Exploration
Bitcoin blockchain data was explored using BigQuery. A data segment
containing fraudulent movements was chosen as sample for analysis in this
project.
Figure 1: Blocks over time Figure 2: Transactions in the sample

Experiment - 2. Network Metrics and extraction of
communities
Communities
1 Network modeling
2 Clustering
3 Giant Component
Figure 3: Communities extraction

Experiment - 3. Features and ML algorithms selection
Figure 4: Selected features
ML Algorithms
1 Decision Tree
1 White-box modeled. Can be
interpreted.
2 Perform well on imbalanced
datasets.
2 Random Forest
1 Ensemble: combine the
predictions of several base
estimators in order to improve
robustness over a single
estimator.
2 Each tree in the ensemble is
built from a sample drawn
with replacement

Experiment - 4. Performance Measures
Classification Precision
It gives the percentage of correct predictions.
Confusion Matrix
It is a 2x2 matrix that tells us the types of errors that the classifier is
making.
AUC - Area Under the (ROC) Curve
It is a single number summary of classifier performance, useful even when
there is class imbalance.

Experiment - 5. Execution
Once the features (transaction network metrics) are obtained, and ML
algortithms and its performance metrics are deﬁned, 2 main tasks need to
be run before ﬁtting the system.
Observations Labeling
Analysis of a real fraudulent transaction.
Dataset Balancing
Once the dataset is labeled, there were many more observations of one
class. An oversampling technique was applied in order to balance it.

Experiment- 5.1. Analysis of a fraudulent transaction
Figure 5: Fraudster Neighbours

Experiment - 5.2. Dataset Balancing
The dataset used
has around 30k
observations in
the training set
and around 7k in
the test set.
Python package
Imbalanced-learn
was used. It
applies an
oversampling on
the minority class.
Table 2: Proportion of classes
Dataset Class Proportion
Train Suspicious 0.498627
Train Non-suspicious 0.501373
Test Suspicious 0.500343
Test Non-suspicious 0.499657

Experiment - 6. Analysis of Results
The obtained metrics of the selected ML algorithms are summarized in the
table below:
Table 3: Classiﬁcation Metrics Comparison
Model Class. Accuracy Sensitivity AUC
Decision Tree 0.9989 0.9979 0.9994
Random Forest 0.9619 0.9752 0.9974
The selected method was the Random Forest, as was the one giving more
weight to the diﬀerent network metrics and still achieving a high accuracy.

The weight given to each of the features of Random Forest is presented in
this barchart.

Table 4: Classification Metrics - Random Forest
METRIC VALUE
Classification accuracy 0.9619
Classification error 0.0380
Sensitivity 0.9752
Specificity 0.9487
False positive rate 0.0512
Precision 0.9502
AUC 0.9974

ROC Curve obtained:

Experiment - 7. Limitations
Studying more known cases of fraud within the bitcoin blockchain, it
could be possible to increase the known fraudulent transaction
patterns.
Having more data will also help to prevent the overﬁtting with
decision trees, as the tree design would not be able to cover all the
training data.

Project Budget
A summary of the project budget is presented in the table.
Cost Total (AC)
Direct Costs 8,827.5
Indirect Costs 882,75
Total Costs 9,710.25
Proﬁt (10%) 971.025
Cost + Proﬁt 10,681.275
IVA (21%) 2,243.06
TOTAL + IVA 12,924.343

Project Planiﬁcation

Legal Framework and socio-economic environment
Legal Framework: The Bitcoin blockchain data is now available for
exploration with BigQuery, using Google Cloud services. Data is
public and no licensing is required.
Socio-economic environment: Blockchain technology is rapidly
evolving and will be widely used in the ﬁnance world in the coming
years.
10 % of world GDP will be stored in blockchains by 2020.
IoT era also promotes the Fintech revolution.
It creates the challenge to develop and apply diﬀerent sets of
techniques in order to detect fraud in these new digital platforms.

Conclusions
1 Business: Detecting and flagging activity suspicious of fraud before it
actually takes place could save billions annually in both developed and
non-developed economies.
2 Technical: The proposed system can flag a suspicious blockchain
transaction with a high accuracy taking into account network metrics
resulting of modeling the giant components of the transactions.
3 Personal: Learning of a ongrowing sector (”Fintech”) that combines
finance and technology as well as of how the analytic techniques can
be applied to it.

Future works
1 Create a software platform that could access and integrate both
environments R and Python.
2 This platform could be running continuously and flag by means of
an UI whenever the model classifies a new observation as Suspicious.
3 Knowing more patterns of fraudulent transactions can help to
avoid the overfitting in the models.
4 Try other network metrics (like mean neighbour degree, node
correlation similarity etc..) as features for the classification model.

Thank you for your attention

Detection of fraud in financial blockchain-based transactions through big data analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Detection of fraud in financial blockchain-based transactions through big data analytics

Similar to Detection of fraud in financial blockchain-based transactions through big data analytics (20)

More from CARLOS III UNIVERSITY OF MADRID

More from CARLOS III UNIVERSITY OF MADRID (20)

Recently uploaded

Recently uploaded (20)

Detection of fraud in financial blockchain-based transactions through big data analytics