Predictive Analysis of Financial Fraud Detection using Azure and Spark ML

Jongwook Woo
BigDAI
HiPIC
CalStateLA
IDEAS SoCal Conf 2018
Oct 20 2018
Jongwook Woo, PhD, jwoo5@calstatela.edu
Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat
Big Data AI Center (BigDAI / HiPIC)
California State University Los Angeles
Predictive Analysis of Financial Fraud
Detection using Azure and Spark ML

Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Introduction To Big Data Predictive Analytics
 Fraud Detection Predictive Analytics
 Summary

Jongwook Woo
CalStateLA
Myself
Experience:
 Since 2002, Professor at California State University Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
 Since 1998: R&D consulting in Hollywood
– Warner Bros (Matrix online game), E!, citysearch.com, ARM etc
– Information Search and Integration with FAST, Lucene/Solr, Sphinx
– implements eBusiness applications using J2EE and middleware
 Since 2007: Exposed to Big Data at CitySearch.com
 2012 - Present : Big Data Academic Partnerships
– For Big Data research and training
• Amazon AWS, MicroSoft Azure, IBM Bluemix
• Databricks, Hadoop vendors

Jongwook Woo
CalStateLA
Myself: Partners for Services

Jongwook Woo
CalStateLA
Experience in Big Data
 Collaboration
 Big Data Technical Advisor of Isaac Engineering for Smart * (Factory, Farms, …) in Korea
 Council Member of IBM Spark Technology Center
 City of Los Angeles for DSF, OpenHub and Open Data
 Startup Companies in Los Angeles
 External Collaborator and Advisor in Big Data
– IMSC of USC
– Pennsylvania State University
– The Big Link, Softzen, Wiken in Korea
 Grants
 Oracle Cloud Big Data, IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research
and Education Grant
 Partnership
 Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS,
Teradata

Jongwook Woo
CalStateLA
Myself: Public Partners

Jongwook Woo
CalStateLA
Myself: S/W Development Lead
http://www.mobygames.com/game/windows/matrix-online/credits

Jongwook Woo
CalStateLA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Smart *: Sensor Data (IoT), Bioinformatics, Social Computing, Streaming
data, smart phone, online game…
Cannot handle with the legacy approach
Too big
Non-/Semi-structured data
Too expensive
Need new systems
Non-expensive

Jongwook Woo
CalStateLA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004

Jongwook Woo
CalStateLA
What is Hadoop?
11
 Hadoop Founder:
o Doug Cutting
 Apache Committer:
Lucene, Nutch, …

Jongwook Woo
CalStateLA
Super Computer vs Hadoop
Parallel vs. Distributed file systems by Michael Malak
Updated by Jongwook Woo
Cluster for Store Cluster for Compute/Store
Cluster for Compute

Jongwook Woo
CalStateLA
Hadoop Cluster: Logical Diagram
Web Browser of Cluster nonitor: CM/Ambari
HTTP(S)
Agent Hadoop Agent Hadoop Agent Hadoop
Cluster Monitor
.
.
.
.
.
.
.
.
.
HDFS HDFS HDFS
HDFS HDFS HDFS
HIVE ZooKeeper Impala

Jongwook Woo
CalStateLA
Hadoop Ecosystems
http://dawn.dbsdataprojects.com/tag/hadoop/

Jongwook Woo
CalStateLA
Definition: Big Data
Non-expensive frameworks that is distributed parallel systems
and that can store a large scale data and process it in parallel [1,
2]
Hadoop
– Non-expensive Super Computer
– More public than the traditional super computers
• You can store and process your applications
– In your university labs, small companies, research centers
Others
– NoSQL DB (Cassandra, MongoDB, Redis, HBase)
– ElasticSearch

Jongwook Woo
CalStateLA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
 In-Memory storage for intermediate data
 20 ~ 100 times faster than N/W and Disk
– MapReduce
Good in Machine Learning
– Iterative algorithms

Jongwook Woo
CalStateLA
Integrating Spark and Hadoop
 Spark
 File Systems: Tachyon
 Resource Manager: Mesos
 Dedicated Spark
– Cassandra, Couchbase…
 Integrating Spark into Hadoop cluster
 As Hadoop has been in the market for over 10 years
 Cloud Computing
– Oracle Cloud Big Data Compute, Amazon AWS, Azure HDInsight, IBM Bluemix, Google
Cloud Platform
• Object Storage, S3
 Hadoop vendors
– HDP, CDH
 Databricks: Spark on AWS & Azure
– Not much Hadoop ecosystems

Big Data AI Center (BDAIC / HiPIC)
Jongwook Woo
CalStateLA
Spark
Spark SQL
Querying using SQL, HiveQL
Data Frame
Spark Streaming
DStream
– RDD in streaming
ML
Machine Learning on Data Frame, Pipelining
MLib
– On RDD
– Sparse vector support, Decision trees, Linear/Logistic Regression, PCA,
SVM, …

Jongwook Woo
CalStateLA
Big Data Analysis and Prediction Flow
Data Collection
Batch API: Yelp, Google
Streaming: Twitter, Apache
NiFi, Kafka, StereamSets,
Storm
Open Data: Government
Data Storage
HDFS, S3, Object Storage,
NoSQL DB (Couchbase)…
Data Filtering
Hive, Pig
Data Analysis and Science
Hive, Pig, Spark, BI Tools
(Qlik, Tableau, …)
Data Visualization
Qlik, Excel PowerMap,
Tableau, Looker, …
- Big Data Engineering
- Big Data Analysis
- Big Data Science
- Data Visualization

Jongwook Woo
CalStateLA
Terms
We know
Data Engineering
– Collect, clean, transform, filter data
Data Analysis
– Find insights from the existing data
Data Science (Predictive Analysis)
– Predict the trend or pattern from the existing data
Do we know?
Big Data Analysis and Science
– Using Big Data for Data Analysis and Science
• Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,..
– For Massive Data Set
• How to store and compute?

Jongwook Woo
CalStateLA
Big Data Science
 Fraud Detection:
Accepted to APJIS journal by Jongwook Woo et al in 2018
– Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat
– Indexed SCOPUS
Goal
Analyzing Transaction data and Fraud Detection
– For Mobile Money Transaction
• based on a sample of real transactions
– extracted from one month of financial logs from a mobile money
service
– using Spark ML (Big Data) and Azure ML (Traditional)

Jongwook Woo
CalStateLA
Financial Data Set
 Data is always issue
 No public available datasets on financial services
– Private nature of financial transactions
PaySim
– URL: https://www.kaggle.com/ntnu-testimon/paysim1
– generate a synthetic dataset
• from the private dataset
– that resembles the normal operation of transactions

Jongwook Woo
CalStateLA
Financial Data Set (Cont‘d)
Size: 470 MB (=> 718MB)
6,362,620 records
Not that large scale data comparing to data set > GB
But its architecture here can be applicable to much bigger data set
– As it still adopt Spark Computing Engine in Big Data
– Linearly scalable
Attributes: 11
Target Column to Predict:
‘isFraud’

Jongwook Woo
CalStateLA
Experiment Environment:
Traditional Systems and Big Data

Jongwook Woo
CalStateLA
Experiment Environment
Azure ML:
Traditional small data set
Implement fundamental prediction models
– Using Sample data: 80MB (1/5 – 1/6 data set)
Select the best model among number of classifications

Jongwook Woo
CalStateLA
Experiment Environment (Cont‘d)
Spark ML
Test with Databricks CE and IBM Cloud
– 470 MB
AWS EMR
– Analyze all data
• 470 MB (=> 718MB)
– Implement and evaluate prediction model
• 3 different models
• Spark Clusters with 3 different # of nodes

Jongwook Woo
CalStateLA
Hardware Specifications: Spark
IBM DSX Lite
Python 2, Spark 2.1
File System: Object Storage
2 Spark Executors, 16GB Memory
Databricks
Python 2, Spark 2.1 (Auto-updating, Scala 2.10)
File System : Databricks File System
Single/Unlimited Cluster, Memory : 6GB Memory

Jongwook Woo
CalStateLA
Experiment Environment
AWS EMR
EMR 12.1
– Spark 2.2.1 on Hadoop 2.8.3
– YARN with Ganglia 3.7.2 and Zeppelin 0.7.3.
 m3.xlarge instance
– Memory: 15.0 GiB,
– CPU: 4 vCPUs,
– Storage: 80 GiB (2 * 40 GiB SSD).
 File System : S3
3 different EMR clusters
– number of nodes that are servers:
• 3, 6, 11 nodes

Jongwook Woo
CalStateLA
PySpark on Databricks

Jongwook Woo
CalStateLA
Work Flow in Azure ML
 Relatively Easy to build and test
Drag and Drop GUI
Work Flow
1. Data Engineering
– Understanding Data
– Data preparation
– Balancing data statistically
2. Data Science: Machine Learning (ML)
– Model building and validation
• Classification algorithms
– Model evaluation
– Model interpretation

Jongwook Woo
CalStateLA
Data Understanding
• Numeric attributes:
amount, oldbalanceOrg, newbalanceOrg, oldbalanceDest,
newbalanceDest
• Categorical attributes:
step, type, isFraud, isFlaggedFraud
• String attributes:
nameOrig, nameDest

Jongwook Woo
CalStateLA
Experiment in Azure ML

Jongwook Woo
CalStateLA
Precision vs Recall
True Positive (TP): Fraud? Yes it is
False Negative (FN): No fraud? but it is
False Positive (FP): Fraud? but it is not
 Precision
 TP / (TP + FP)
 Recall
 TP / (TP + FN)
 Ref: https://en.wikipedia.org/wiki/Precision_and_recall
Positive:
Event occurs
(Fraud)
Negative: Event
does not
Occur (non
Fraud)

Jongwook Woo
CalStateLA
Model Evaluation
More into Recall
to capture the most fraudelent transactions
Bad Recall: Fatal
–If many false negative (FN)
• predict the transaction as normal not fraud
– but it is a fraud
–Painful
• Need to decrease FN
– That is to increase Recall

Jongwook Woo
CalStateLA
Experimental Results in AzureML
Model Accuracy Precision Recall
Two Class Logistic Regression
Two Class Decision Forest
Two Class Decision Jungle 0.916 0.998

Jongwook Woo
CalStateLA
Experimental Results
Accuracy
Decision Jungle
– Highest Recall 0.998
• While Precision: 0.916
– With small sample data set: 359KB
• takes 11 sec
Performance:
Times taken to build a model with whole data set:
– 470MB + data tweaking
– Over a day
Good Guide
to adopt the 3 similar algorithms for Spark ML
– Decision Tree, Random Forest, Logistic Regression

Jongwook Woo
CalStateLA
Experiment with Spark ML
1. Load the data source
 470 MB (=> 718MB)
2. Train and build the models
o Balanced data statistically
3. Evaluate

Jongwook Woo
CalStateLA
Define the pipeline

Jongwook Woo
CalStateLA
Train the models

Jongwook Woo
CalStateLA
Pipeline Classification with Spark ML
Feature
Transformer
ParamMap
Estimator
Classification
Estimator
Classification
Evaluator
Validation
Estimator
Model
Transformer
Classification
Evaluator

Jongwook Woo
CalStateLA
Feature
Transformer
ParamMap
Estimator
Classification
Estimator
Classification
Evaluator
Model
Transformer
Classification
Evaluator
Feature is
generated from
input columns
Validation
Estimator

Jongwook Woo
CalStateLA
Feature
Transformer
ParamMap
Estimator
Classification
Estimator
Classification
Evaluator
Model
Transformer
Classification
Evaluator
Classifiers: Decision
Tree, RandomForest,
LogisticRegression
Validation
Estimator

Jongwook Woo
CalStateLA
Feature
Transformer
ParamMap
Estimator
Classification
Estimator
Classification
Evaluator
Model
Transformer
Classification
Evaluator
Combination of
Parameters: Max
Bins, Max Depth,…
Validation
Estimator

Jongwook Woo
CalStateLA
Feature
Transformer
ParamMap
Estimator
Classification
Estimator
Classification
Evaluator
Model
Transformer
Classification
Evaluator
Validators: Cross
Validator, Train
Validation Split
Validation
Estimator

Jongwook Woo
CalStateLA
Results
Model Area under
ROC
Precision Recall
DecisionTreeClassifier
RandomForestClassifier 0.909573
LogisticRegression
• 3 models with different combinations of the parameters
• Times taken (Spark Cluster): 1 hour
• In theory of Linear Scalability: 2 minutes with 30 Spark clsters
• The Random Forest has the best recall score
• compared to Decision Tree and Logistic Regression.

Jongwook Woo
CalStateLA
Experimental Results in AWS
Execution times
3 nodes:
–40min – 70mins
11 nodes
–10min – 20mins

Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Smart Factory with Big Data
 Summary

Jongwook Woo
CalStateLA
Summary
Introduction to Big Data
Introduction to Big Data Predictive Analytics
Experimental Result of Fraud Detection
Recall:
– RandomForest in SparkML
– DecisionJungle in AzureML
Performance:
– Traditional Systems:
• not good for large scale data
– Spark ML:
• Linearly Scalable
• Fast

Jongwook Woo
CalStateLA
Questions?

Jongwook Woo
CalStateLA
References
1. “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang
Xu, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications
(PDPTA 2011), Las Vegas (July 18-21, 2011)
2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley
Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-
452, ISSN 1942-4795
3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016
4. How to choose algorithms for Microsoft Azure Machine Learning, https://docs.microsoft.com/en-
us/azure/machine-learning/machine-learning-algorithm-choice
5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag Chhadva, Shubhra
Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-Data-Analysis-using-Spark.pdf
6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html
7. (Accepted in Sept 2018) Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat, Jongwook Woo,
"Predictive Analysis of Financial Fraud Detection using Azure and Spark ML", Asia Pacific Journal of
Information Systems

Jongwook Woo
CalStateLA
References
8. TensorFrames: Google Tensorflow on Apache Spark, https://www.slideshare.net/databricks/tensorframes-
google-tensorflow-on-apache-spark
9. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-and-apache-
spark
10. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark,
https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-frameworks-on-
spark
11. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark,
https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-at-scalewith-
apache-spark-keynote-by-ziya-ma
12. Deep Learning with Apache Spark and TensorFlow, https://databricks.com/blog/2016/01/25/deep-
learning-with-apache-spark-and-tensorflow.html
13. Tensor Flow Deep Learning Open SAP
14. Overview of Smart Factory, https://www.slideshare.net/BrendanSheppard1/overview-of-smart-factory-
solutions-68137094/6
15. https://dzone.com/articles/sqoop-import-data-from-mysql-tohive

Predictive Analysis of Financial Fraud Detection using Azure and Spark ML

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Predictive Analysis of Financial Fraud Detection using Azure and Spark ML

Similar to Predictive Analysis of Financial Fraud Detection using Azure and Spark ML (20)

More from Jongwook Woo

More from Jongwook Woo (11)

Recently uploaded

Recently uploaded (20)

Predictive Analysis of Financial Fraud Detection using Azure and Spark ML