Fraud Detection: Harnessing Data Science for Securing Transactions

CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
Fraud Detection
Vaidehi Chandre
Batch: AND-JUN2024-DSAI-1
Room No: TR 2
Place: Mumbai

Agenda
1. Introduction 10. Evaluation & Metrics
2. Problem Statement 11. Metrics Result
3. Goal 12.Conclusion
4. Understanding the Dataset
5. Exploratory Data Analysis
6. Data cleaning and Preprocessing
7. Feature Engineering
8. Feature Extraction and Correlation Analysis
9. Model Selection

Introduction
Online payments have transformed financial transactions but have also led to an increase in
fraudulent activities. Early detection of fraud can prevent significant financial and
reputational losses. Detecting fraudulent transactions promptly is critical to safeguard users
and businesses. Machine Learning (ML) offers a data-driven approach to identify fraudulent
patterns effectively.
This project leverages machine learning to
build a model for fraud detection.

Click to edit
Master title
style
Problem Statement
• Fraudulent transactions are rare but impactful,
requiring high sensitivity in detection.
• Manual review is infeasible due to the volume of
transactions.
• In the banking industry, credit card fraud
detection is not only a trend but a necessity for
them to put proactive monitoring and fraud
prevention mechanisms in place.
• In this project we will detect fraudulent credit
card transactions with the help of Machine
learning models. We will analyze customer-level
data that has been collected and analyzed.

GOAL
• Build and evaluate multiple ML models
to classify transactions as fraud or not
fraud.
• Compare models using metrics like
accuracy, precision, recall, F1 score, and
AUC-ROC.
• Compare the performance of three
algorithms: Logistic Regression,
Decision Tree, Gaussian Naive Bayes
• Select the best-performing model for
real-world deployment.

Understanding the dataset
Online Payments Fraud Detection Dataset.
Key Features:
• type: Type of transaction (e.g., CASH_OUT,
TRANSFER).
• amount: Transaction amount.
• isFraud: Target variable indicating fraud (1)
or not fraud (0).
• Balances before and after the
transoldbalanceOrg and newbalanceOrig:
action for the origin account.
• oldbalanceDest and newbalanceDest:
Balances before and after the transaction for
the destination account.
• Dataset Shape: Rows: 246946, Columns: 11
(replace with actual numbers).

Exploratory Data
Analysis
•Identify the presence of outliers in the dataset.
•Outliers are extreme values that may skew model
performance.
•Visualization:
•A horizontal boxplot is created for all numerical
features in the dataset.
•The whiskers of the boxplot show the range of data,
and points outside the whiskers are potential
outliers.

Exploratory Data
Analysis
EDA of the “type” Feature
df['type'].unique() –
Displays all unique transation types in the type
column.
Visualization:
A countplot is created to show the number of
transactions for each type.
•X-axis: Transaction types (CASH_OUT,
PAYMENT, TRANSFER etc.).
•Y-axis: Number of transactions.
•The grid makes the visualization more readable.
•Helps identify which transaction types dominate
the dataset.

Exploratory Data
Analysis
• EDA of the "isFraud" Feature
df['isFraud'].unique().astype(int)
Lists the unique values in the isFraud column
(0: Not Fraud, 1: Fraud).
• Dropping unnecessary columns
df.drop(['nameOrig', 'nameDest'], axis=1, inplace=True)
Removes columns nameOrig and nameDest.
These are identifiers and do not contribute to fraud prediction.
Dropping these reduces dimensionality and computational
overhead
• Label Encoding for "type" Feature
df['type'].replace({'CASH_OUT': 0,
'PAYMENT': 1, 'CASH_IN': 2,
'TRANSFER': 3, 'DEBIT': 4},
inplace=True)
df['type'].value_counts()
Converts the categorical values in type into
numerical labels (0–4).
Necessary for feeding data into ML models
that require numerical inputs.
type column now contains integers instead of
strings, simplifying computation

Exploratory Data
Analysis
Visualization:
A barplot is created to visualize the counts
of isFraud.
X-axis: Fraud status (0 = Not Fraud, 1 =
Fraud).
Y-axis: Number of transactions.
Uses a vibrant color palette for
differentiation.

Data Cleaning and Preprocessing
•Checked for missing values: None found.
•Removed irrelevant features: nameOrig and nameDest (do
not impact fraud detection).
•Split data into training (70%) and testing (30%) sets.
•Applied standardization using StandardScaler to normalize
numerical features.

• Dropped low-relevance features: newbalanceOrig,
oldbalanceDest (low correlation with fraud).
• Focused on features with significant impact on isFraud, such as
type and amount.
• Processed categorical features (type) into numerical format for
modeling.
• Retained significant features to improve model performance and
reduce overfitting.
Feature Engineering

Feature Extraction and
Correlation Analysis
• Correlation Analysis
df.corr()
Calculates pairwise correlations between numerical
features and the target variable (isFraud).Understand
relationships between features.Identify highly
correlated features to keep or drop.
• Visualization:
A heatmap visualizes correlations.Features with
strong positive or negative correlation to isFraud
are key for prediction.
Colors:
Dark red: High positive correlation.
Dark blue: High negative correlation.

Model Selection
• Logistic Regression
Logistic Regression is effective for binary classification when the relationship between features and the target is
approximately linear. The model’s probabilistic output helps in interpreting and ranking predictions for fraud detection.
• Decision Tree
Decision Trees tend to overfit, especially on datasets with imbalanced classes or noisy features.
Lack of regularization or pruning in the default setup means the model memorizes specific patterns in the training set
rather than learning generalizable rules.
• Naive Bayes
Naive Bayes assumes feature independence, which is rarely true in real-world datasets.
In fraud detection datasets, features are often correlated (e.g., transaction amount and balance), making this assumption
invalid.
• K-Nearest Neighbours
KNN is a non-parametric, instance-based learning algorithm that classifies a data point based on the majority class of
its nearest neighbors.
• Random Forest
Random Forest is an ensemble learning method that constructs multiple decision trees and combines their outputs to
make a final prediction. It reduces overfitting and improves generalization. Random Forest is the best-performing
model for this dataset, achieving near-perfect accuracy and excelling in all key metrics. Its ensemble approach makes it
highly robust and reliable for detecting fraud in imbalanced datasets.
.

Evaluation & Metrics
• Best performing Model
Random Forest is the best-performing model for this
dataset, achieving near-perfect accuracy and
excelling in all key metrics. Its ensemble approach
makes it highly robust and reliable for detecting
fraud in imbalanced datasets.
• Poor performing Model
Naive Bayes struggled with the interdependence of
features in this dataset. While it had a decent recall,
its precision was very low, leading to many false
positives.

Evaluation & Metrics
• Best Performing Models:
Random Forest achieves high accuracy and
excellent precision, significantly reducing false
positives. It also provides high recall, ensuring
most fraud cases are detected. The F1-score is
balanced and high, where ROC-AUC is strong.
• Moderate Performing Models:
The Decision Tree model performs well with
high accuracy and decent precision and
recall.. The F1-score is moderate, indicating
a balance between precision and recall. KNN
performed well overall, the Recall is
moderate however precision and AUC_ROC
is balanced.
• Poor performing Models:
Naive Bayes tends to have low to moderate
accuracy for fraud detection. It struggled
because it assumes feature independence,
which rarely holds true in real-world data.
Naive Bayes is not the best choice for fraud
detection tasks due to its inherent assumptions
and inability to model feature dependencies
effectively.
Random Forest K- Nearest Neighbour
Decision Tree Naïve Bayes

Metrics Result
• Logistic Regression: Accuracy isModerate
while Precision & Recall are Balanced but
struggles with complex patterns.Best For:
Simplicity and interpretability.
• Decision Tree: Accuracy is Moderate to high
while Precision is Decent, but prone to
overfitting it’s best for Clear decision-making
and smaller datasets.
• Naive Bayes: Accuracy is Low to moderate
while Recall is High but low precision due
to false positives its best for fast
computations and baseline comparisons.
• K-Nearest Neighbors (KNN): Accuracy is
Moderate but sensitive to data scaling while
Precision & Recall Performs better with
balanced data but can be computationally
intensive its best for small datasets with
clear clusters.
• Random Forest: Accuracy is High, with
strong generalization while Precision &
Recall its excellent, handling imbalanced
datasets well. Best for Robust, complex
datasets with high feature importance.

Conclusion
This fraud detection project aims to identify fraudulent
transactions within a dataset comprising transaction
types, amounts, timestamps, and labels indicating fraud
or non-fraud. The dataset includes both balanced and
imbalanced aspects, with a significant portion of
fraudulent activities associated with specific transaction
types.Various machine learning models were applied,
including Logistic Regression, Decision Tree, Naive
Bayes, K-Nearest Neighbors (KNN), and Random
Forest. Among these, Random Forest demonstrated
superior performance with high accuracy and balanced
precision and recall, making it the most suitable for
complex fraud patterns. This project has practical
applications in banking and financial systems, helping
institutions detect and prevent fraud in real-time,
ultimately ensuring financial security and building
customer trust.

Thank You!

Fraud Detection: Harnessing Data Science for Securing Transactions

More Related Content

What's hot

Similar to Fraud Detection: Harnessing Data Science for Securing Transactions

More from Boston Institute of Analytics

Recently uploaded

Fraud Detection: Harnessing Data Science for Securing Transactions