CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
Fraud Detection
Vaidehi Chandre
Batch: AND-JUN2024-DSAI-1
Room No: TR 2
Place: Mumbai
Agenda
1. Introduction 10. Evaluation & Metrics
2. Problem Statement 11. Metrics Result
3. Goal 12.Conclusion
4. Understanding the Dataset
5. Exploratory Data Analysis
6. Data cleaning and Preprocessing
7. Feature Engineering
8. Feature Extraction and Correlation Analysis
9. Model Selection
Introduction
Online payments have transformed financial transactions but have also led to an increase in
fraudulent activities. Early detection of fraud can prevent significant financial and
reputational losses. Detecting fraudulent transactions promptly is critical to safeguard users
and businesses. Machine Learning (ML) offers a data-driven approach to identify fraudulent
patterns effectively.
This project leverages machine learning to
build a model for fraud detection.
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
Click to edit
Master title
style
Problem Statement
• Fraudulent transactions are rare but impactful,
requiring high sensitivity in detection.
• Manual review is infeasible due to the volume of
transactions.
• In the banking industry, credit card fraud
detection is not only a trend but a necessity for
them to put proactive monitoring and fraud
prevention mechanisms in place.
• In this project we will detect fraudulent credit
card transactions with the help of Machine
learning models. We will analyze customer-level
data that has been collected and analyzed.
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
GOAL
• Build and evaluate multiple ML models
to classify transactions as fraud or not
fraud.
• Compare models using metrics like
accuracy, precision, recall, F1 score, and
AUC-ROC.
• Compare the performance of three
algorithms: Logistic Regression,
Decision Tree, Gaussian Naive Bayes
• Select the best-performing model for
real-world deployment.
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
Understanding the dataset
Online Payments Fraud Detection Dataset.
Key Features:
• type: Type of transaction (e.g., CASH_OUT,
TRANSFER).
• amount: Transaction amount.
• isFraud: Target variable indicating fraud (1)
or not fraud (0).
• Balances before and after the
transoldbalanceOrg and newbalanceOrig:
action for the origin account.
• oldbalanceDest and newbalanceDest:
Balances before and after the transaction for
the destination account.
• Dataset Shape: Rows: 246946, Columns: 11
(replace with actual numbers).
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
Exploratory Data
Analysis
•Identify the presence of outliers in the dataset.
•Outliers are extreme values that may skew model
performance.
•Visualization:
•A horizontal boxplot is created for all numerical
features in the dataset.
•The whiskers of the boxplot show the range of data,
and points outside the whiskers are potential
outliers.
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
Exploratory Data
Analysis
EDA of the “type” Feature
df['type'].unique() –
Displays all unique transation types in the type
column.
Visualization:
A countplot is created to show the number of
transactions for each type.
•X-axis: Transaction types (CASH_OUT,
PAYMENT, TRANSFER etc.).
•Y-axis: Number of transactions.
•The grid makes the visualization more readable.
•Helps identify which transaction types dominate
the dataset.
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
Exploratory Data
Analysis
• EDA of the "isFraud" Feature
df['isFraud'].unique().astype(int)
Lists the unique values in the isFraud column
(0: Not Fraud, 1: Fraud).
• Dropping unnecessary columns
df.drop(['nameOrig', 'nameDest'], axis=1, inplace=True)
Removes columns nameOrig and nameDest.
These are identifiers and do not contribute to fraud prediction.
Dropping these reduces dimensionality and computational
overhead
• Label Encoding for "type" Feature
df['type'].replace({'CASH_OUT': 0,
'PAYMENT': 1, 'CASH_IN': 2,
'TRANSFER': 3, 'DEBIT': 4},
inplace=True)
df['type'].value_counts()
Converts the categorical values in type into
numerical labels (0–4).
Necessary for feeding data into ML models
that require numerical inputs.
type column now contains integers instead of
strings, simplifying computation
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
Exploratory Data
Analysis
Visualization:
A barplot is created to visualize the counts
of isFraud.
X-axis: Fraud status (0 = Not Fraud, 1 =
Fraud).
Y-axis: Number of transactions.
Uses a vibrant color palette for
differentiation.
Data Cleaning and Preprocessing
•Checked for missing values: None found.
•Removed irrelevant features: nameOrig and nameDest (do
not impact fraud detection).
•Split data into training (70%) and testing (30%) sets.
•Applied standardization using StandardScaler to normalize
numerical features.
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
• Dropped low-relevance features: newbalanceOrig,
oldbalanceDest (low correlation with fraud).
• Focused on features with significant impact on isFraud, such as
type and amount.
• Processed categorical features (type) into numerical format for
modeling.
• Retained significant features to improve model performance and
reduce overfitting.
Feature Engineering
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
Feature Extraction and
Correlation Analysis
• Correlation Analysis
df.corr()
Calculates pairwise correlations between numerical
features and the target variable (isFraud).Understand
relationships between features.Identify highly
correlated features to keep or drop.
• Visualization:
A heatmap visualizes correlations.Features with
strong positive or negative correlation to isFraud
are key for prediction.
Colors:
Dark red: High positive correlation.
Dark blue: High negative correlation.
Model Selection
• Logistic Regression
Logistic Regression is effective for binary classification when the relationship between features and the target is
approximately linear. The model’s probabilistic output helps in interpreting and ranking predictions for fraud detection.
• Decision Tree
Decision Trees tend to overfit, especially on datasets with imbalanced classes or noisy features.
Lack of regularization or pruning in the default setup means the model memorizes specific patterns in the training set
rather than learning generalizable rules.
• Naive Bayes
Naive Bayes assumes feature independence, which is rarely true in real-world datasets.
In fraud detection datasets, features are often correlated (e.g., transaction amount and balance), making this assumption
invalid.
• K-Nearest Neighbours
KNN is a non-parametric, instance-based learning algorithm that classifies a data point based on the majority class of
its nearest neighbors.
• Random Forest
Random Forest is an ensemble learning method that constructs multiple decision trees and combines their outputs to
make a final prediction. It reduces overfitting and improves generalization. Random Forest is the best-performing
model for this dataset, achieving near-perfect accuracy and excelling in all key metrics. Its ensemble approach makes it
highly robust and reliable for detecting fraud in imbalanced datasets.
.
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
Evaluation & Metrics
• Best performing Model
Random Forest is the best-performing model for this
dataset, achieving near-perfect accuracy and
excelling in all key metrics. Its ensemble approach
makes it highly robust and reliable for detecting
fraud in imbalanced datasets.
• Poor performing Model
Naive Bayes struggled with the interdependence of
features in this dataset. While it had a decent recall,
its precision was very low, leading to many false
positives.
Evaluation & Metrics
• Best Performing Models:
Random Forest achieves high accuracy and
excellent precision, significantly reducing false
positives. It also provides high recall, ensuring
most fraud cases are detected. The F1-score is
balanced and high, where ROC-AUC is strong.
• Moderate Performing Models:
The Decision Tree model performs well with
high accuracy and decent precision and
recall.. The F1-score is moderate, indicating
a balance between precision and recall. KNN
performed well overall, the Recall is
moderate however precision and AUC_ROC
is balanced.
• Poor performing Models:
Naive Bayes tends to have low to moderate
accuracy for fraud detection. It struggled
because it assumes feature independence,
which rarely holds true in real-world data.
Naive Bayes is not the best choice for fraud
detection tasks due to its inherent assumptions
and inability to model feature dependencies
effectively.
Random Forest K- Nearest Neighbour
Decision Tree Naïve Bayes
Metrics Result
• Logistic Regression: Accuracy isModerate
while Precision & Recall are Balanced but
struggles with complex patterns.Best For:
Simplicity and interpretability.
• Decision Tree: Accuracy is Moderate to high
while Precision is Decent, but prone to
overfitting it’s best for Clear decision-making
and smaller datasets.
• Naive Bayes: Accuracy is Low to moderate
while Recall is High but low precision due
to false positives its best for fast
computations and baseline comparisons.
• K-Nearest Neighbors (KNN): Accuracy is
Moderate but sensitive to data scaling while
Precision & Recall Performs better with
balanced data but can be computationally
intensive its best for small datasets with
clear clusters.
• Random Forest: Accuracy is High, with
strong generalization while Precision &
Recall its excellent, handling imbalanced
datasets well. Best for Robust, complex
datasets with high feature importance.
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
Conclusion
This fraud detection project aims to identify fraudulent
transactions within a dataset comprising transaction
types, amounts, timestamps, and labels indicating fraud
or non-fraud. The dataset includes both balanced and
imbalanced aspects, with a significant portion of
fraudulent activities associated with specific transaction
types.Various machine learning models were applied,
including Logistic Regression, Decision Tree, Naive
Bayes, K-Nearest Neighbors (KNN), and Random
Forest. Among these, Random Forest demonstrated
superior performance with high accuracy and balanced
precision and recall, making it the most suitable for
complex fraud patterns. This project has practical
applications in banking and financial systems, helping
institutions detect and prevent fraud in real-time,
ultimately ensuring financial security and building
customer trust.
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
Thank You!

Fraud Detection: Harnessing Data Science for Securing Transactions

  • 1.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. Fraud Detection Vaidehi Chandre Batch: AND-JUN2024-DSAI-1 Room No: TR 2 Place: Mumbai
  • 2.
    Agenda 1. Introduction 10.Evaluation & Metrics 2. Problem Statement 11. Metrics Result 3. Goal 12.Conclusion 4. Understanding the Dataset 5. Exploratory Data Analysis 6. Data cleaning and Preprocessing 7. Feature Engineering 8. Feature Extraction and Correlation Analysis 9. Model Selection
  • 3.
    Introduction Online payments havetransformed financial transactions but have also led to an increase in fraudulent activities. Early detection of fraud can prevent significant financial and reputational losses. Detecting fraudulent transactions promptly is critical to safeguard users and businesses. Machine Learning (ML) offers a data-driven approach to identify fraudulent patterns effectively. This project leverages machine learning to build a model for fraud detection.
  • 4.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. Click to edit Master title style Problem Statement • Fraudulent transactions are rare but impactful, requiring high sensitivity in detection. • Manual review is infeasible due to the volume of transactions. • In the banking industry, credit card fraud detection is not only a trend but a necessity for them to put proactive monitoring and fraud prevention mechanisms in place. • In this project we will detect fraudulent credit card transactions with the help of Machine learning models. We will analyze customer-level data that has been collected and analyzed.
  • 5.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. GOAL • Build and evaluate multiple ML models to classify transactions as fraud or not fraud. • Compare models using metrics like accuracy, precision, recall, F1 score, and AUC-ROC. • Compare the performance of three algorithms: Logistic Regression, Decision Tree, Gaussian Naive Bayes • Select the best-performing model for real-world deployment.
  • 6.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. Understanding the dataset Online Payments Fraud Detection Dataset. Key Features: • type: Type of transaction (e.g., CASH_OUT, TRANSFER). • amount: Transaction amount. • isFraud: Target variable indicating fraud (1) or not fraud (0). • Balances before and after the transoldbalanceOrg and newbalanceOrig: action for the origin account. • oldbalanceDest and newbalanceDest: Balances before and after the transaction for the destination account. • Dataset Shape: Rows: 246946, Columns: 11 (replace with actual numbers).
  • 7.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. Exploratory Data Analysis •Identify the presence of outliers in the dataset. •Outliers are extreme values that may skew model performance. •Visualization: •A horizontal boxplot is created for all numerical features in the dataset. •The whiskers of the boxplot show the range of data, and points outside the whiskers are potential outliers.
  • 8.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. Exploratory Data Analysis EDA of the “type” Feature df['type'].unique() – Displays all unique transation types in the type column. Visualization: A countplot is created to show the number of transactions for each type. •X-axis: Transaction types (CASH_OUT, PAYMENT, TRANSFER etc.). •Y-axis: Number of transactions. •The grid makes the visualization more readable. •Helps identify which transaction types dominate the dataset.
  • 9.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. Exploratory Data Analysis • EDA of the "isFraud" Feature df['isFraud'].unique().astype(int) Lists the unique values in the isFraud column (0: Not Fraud, 1: Fraud). • Dropping unnecessary columns df.drop(['nameOrig', 'nameDest'], axis=1, inplace=True) Removes columns nameOrig and nameDest. These are identifiers and do not contribute to fraud prediction. Dropping these reduces dimensionality and computational overhead • Label Encoding for "type" Feature df['type'].replace({'CASH_OUT': 0, 'PAYMENT': 1, 'CASH_IN': 2, 'TRANSFER': 3, 'DEBIT': 4}, inplace=True) df['type'].value_counts() Converts the categorical values in type into numerical labels (0–4). Necessary for feeding data into ML models that require numerical inputs. type column now contains integers instead of strings, simplifying computation
  • 10.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. Exploratory Data Analysis Visualization: A barplot is created to visualize the counts of isFraud. X-axis: Fraud status (0 = Not Fraud, 1 = Fraud). Y-axis: Number of transactions. Uses a vibrant color palette for differentiation.
  • 11.
    Data Cleaning andPreprocessing •Checked for missing values: None found. •Removed irrelevant features: nameOrig and nameDest (do not impact fraud detection). •Split data into training (70%) and testing (30%) sets. •Applied standardization using StandardScaler to normalize numerical features.
  • 12.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. • Dropped low-relevance features: newbalanceOrig, oldbalanceDest (low correlation with fraud). • Focused on features with significant impact on isFraud, such as type and amount. • Processed categorical features (type) into numerical format for modeling. • Retained significant features to improve model performance and reduce overfitting. Feature Engineering
  • 13.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. Feature Extraction and Correlation Analysis • Correlation Analysis df.corr() Calculates pairwise correlations between numerical features and the target variable (isFraud).Understand relationships between features.Identify highly correlated features to keep or drop. • Visualization: A heatmap visualizes correlations.Features with strong positive or negative correlation to isFraud are key for prediction. Colors: Dark red: High positive correlation. Dark blue: High negative correlation.
  • 14.
    Model Selection • LogisticRegression Logistic Regression is effective for binary classification when the relationship between features and the target is approximately linear. The model’s probabilistic output helps in interpreting and ranking predictions for fraud detection. • Decision Tree Decision Trees tend to overfit, especially on datasets with imbalanced classes or noisy features. Lack of regularization or pruning in the default setup means the model memorizes specific patterns in the training set rather than learning generalizable rules. • Naive Bayes Naive Bayes assumes feature independence, which is rarely true in real-world datasets. In fraud detection datasets, features are often correlated (e.g., transaction amount and balance), making this assumption invalid. • K-Nearest Neighbours KNN is a non-parametric, instance-based learning algorithm that classifies a data point based on the majority class of its nearest neighbors. • Random Forest Random Forest is an ensemble learning method that constructs multiple decision trees and combines their outputs to make a final prediction. It reduces overfitting and improves generalization. Random Forest is the best-performing model for this dataset, achieving near-perfect accuracy and excelling in all key metrics. Its ensemble approach makes it highly robust and reliable for detecting fraud in imbalanced datasets. .
  • 15.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. Evaluation & Metrics • Best performing Model Random Forest is the best-performing model for this dataset, achieving near-perfect accuracy and excelling in all key metrics. Its ensemble approach makes it highly robust and reliable for detecting fraud in imbalanced datasets. • Poor performing Model Naive Bayes struggled with the interdependence of features in this dataset. While it had a decent recall, its precision was very low, leading to many false positives.
  • 16.
    Evaluation & Metrics •Best Performing Models: Random Forest achieves high accuracy and excellent precision, significantly reducing false positives. It also provides high recall, ensuring most fraud cases are detected. The F1-score is balanced and high, where ROC-AUC is strong. • Moderate Performing Models: The Decision Tree model performs well with high accuracy and decent precision and recall.. The F1-score is moderate, indicating a balance between precision and recall. KNN performed well overall, the Recall is moderate however precision and AUC_ROC is balanced. • Poor performing Models: Naive Bayes tends to have low to moderate accuracy for fraud detection. It struggled because it assumes feature independence, which rarely holds true in real-world data. Naive Bayes is not the best choice for fraud detection tasks due to its inherent assumptions and inability to model feature dependencies effectively. Random Forest K- Nearest Neighbour Decision Tree Naïve Bayes
  • 17.
    Metrics Result • LogisticRegression: Accuracy isModerate while Precision & Recall are Balanced but struggles with complex patterns.Best For: Simplicity and interpretability. • Decision Tree: Accuracy is Moderate to high while Precision is Decent, but prone to overfitting it’s best for Clear decision-making and smaller datasets. • Naive Bayes: Accuracy is Low to moderate while Recall is High but low precision due to false positives its best for fast computations and baseline comparisons. • K-Nearest Neighbors (KNN): Accuracy is Moderate but sensitive to data scaling while Precision & Recall Performs better with balanced data but can be computationally intensive its best for small datasets with clear clusters. • Random Forest: Accuracy is High, with strong generalization while Precision & Recall its excellent, handling imbalanced datasets well. Best for Robust, complex datasets with high feature importance.
  • 18.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. Conclusion This fraud detection project aims to identify fraudulent transactions within a dataset comprising transaction types, amounts, timestamps, and labels indicating fraud or non-fraud. The dataset includes both balanced and imbalanced aspects, with a significant portion of fraudulent activities associated with specific transaction types.Various machine learning models were applied, including Logistic Regression, Decision Tree, Naive Bayes, K-Nearest Neighbors (KNN), and Random Forest. Among these, Random Forest demonstrated superior performance with high accuracy and balanced precision and recall, making it the most suitable for complex fraud patterns. This project has practical applications in banking and financial systems, helping institutions detect and prevent fraud in real-time, ultimately ensuring financial security and building customer trust.
  • 19.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. Thank You!