This talk aims at providing insights, performance, and architecture on Financial Fraud Detection on a mobile money transactional activity in Azure ML and Spark. We have predicted and classified the transaction as normal or fraud with a small sample and massive data set using Azure ML and Spark ML, which are traditional systems and Big Data respectively. I will present predictive analysis with several classification models experimenting in Azure and Spark ML. Besides, scalability of Spark ML will be presented for the models with different number of nodes for Spark clusters in Amazon AWS.
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
1. Jongwook Woo
BigDAI
HiPIC
CalStateLA
IDEAS SoCal Conf 2018
Oct 20 2018
Jongwook Woo, PhD, jwoo5@calstatela.edu
Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat
Big Data AI Center (BigDAI / HiPIC)
California State University Los Angeles
Predictive Analysis of Financial Fraud
Detection using Azure and Spark ML
2. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
Introduction To Big Data Predictive Analytics
Fraud Detection Predictive Analytics
Summary
3. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Myself
Experience:
Since 2002, Professor at California State University Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
Since 1998: R&D consulting in Hollywood
– Warner Bros (Matrix online game), E!, citysearch.com, ARM etc
– Information Search and Integration with FAST, Lucene/Solr, Sphinx
– implements eBusiness applications using J2EE and middleware
Since 2007: Exposed to Big Data at CitySearch.com
2012 - Present : Big Data Academic Partnerships
– For Big Data research and training
• Amazon AWS, MicroSoft Azure, IBM Bluemix
• Databricks, Hadoop vendors
4. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Myself: Partners for Services
5. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experience in Big Data
Collaboration
Big Data Technical Advisor of Isaac Engineering for Smart * (Factory, Farms, …) in Korea
Council Member of IBM Spark Technology Center
City of Los Angeles for DSF, OpenHub and Open Data
Startup Companies in Los Angeles
External Collaborator and Advisor in Big Data
– IMSC of USC
– Pennsylvania State University
– The Big Link, Softzen, Wiken in Korea
Grants
Oracle Cloud Big Data, IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research
and Education Grant
Partnership
Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS,
Teradata
6. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Myself: Public Partners
7. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Myself: S/W Development Lead
http://www.mobygames.com/game/windows/matrix-online/credits
8. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
Introduction To Big Data Predictive Analytics
Fraud Detection Predictive Analytics
Summary
9. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Smart *: Sensor Data (IoT), Bioinformatics, Social Computing, Streaming
data, smart phone, online game…
Cannot handle with the legacy approach
Too big
Non-/Semi-structured data
Too expensive
Need new systems
Non-expensive
10. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004
11. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
What is Hadoop?
11
Hadoop Founder:
o Doug Cutting
Apache Committer:
Lucene, Nutch, …
12. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Super Computer vs Hadoop
Parallel vs. Distributed file systems by Michael Malak
Updated by Jongwook Woo
Cluster for Store Cluster for Compute/Store
Cluster for Compute
13. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Hadoop Cluster: Logical Diagram
Web Browser of Cluster nonitor: CM/Ambari
HTTP(S)
Agent Hadoop Agent Hadoop Agent Hadoop
Agent Hadoop Agent Hadoop Agent Hadoop
Cluster Monitor
.
.
.
.
.
.
.
.
.
Agent Hadoop Agent Hadoop Agent Hadoop
HDFS HDFS HDFS
HDFS HDFS HDFS
HIVE ZooKeeper Impala
14. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Hadoop Ecosystems
http://dawn.dbsdataprojects.com/tag/hadoop/
15. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Definition: Big Data
Non-expensive frameworks that is distributed parallel systems
and that can store a large scale data and process it in parallel [1,
2]
Hadoop
– Non-expensive Super Computer
– More public than the traditional super computers
• You can store and process your applications
– In your university labs, small companies, research centers
Others
– NoSQL DB (Cassandra, MongoDB, Redis, HBase)
– ElasticSearch
16. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
Introduction To Big Data Predictive Analytics
Fraud Detection Predictive Analytics
Summary
17. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
In-Memory storage for intermediate data
20 ~ 100 times faster than N/W and Disk
– MapReduce
Good in Machine Learning
– Iterative algorithms
18. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Integrating Spark and Hadoop
Spark
File Systems: Tachyon
Resource Manager: Mesos
Dedicated Spark
– Cassandra, Couchbase…
Integrating Spark into Hadoop cluster
As Hadoop has been in the market for over 10 years
Cloud Computing
– Oracle Cloud Big Data Compute, Amazon AWS, Azure HDInsight, IBM Bluemix, Google
Cloud Platform
• Object Storage, S3
Hadoop vendors
– HDP, CDH
Databricks: Spark on AWS & Azure
– Not much Hadoop ecosystems
19. Big Data AI Center (BDAIC / HiPIC)
Jongwook Woo
CalStateLA
Spark
Spark SQL
Querying using SQL, HiveQL
Data Frame
Spark Streaming
DStream
– RDD in streaming
ML
Machine Learning on Data Frame, Pipelining
MLib
– On RDD
– Sparse vector support, Decision trees, Linear/Logistic Regression, PCA,
SVM, …
20. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
Introduction To Big Data Predictive Analytics
Fraud Detection Predictive Analytics
Summary
21. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Big Data Analysis and Prediction Flow
Data Collection
Batch API: Yelp, Google
Streaming: Twitter, Apache
NiFi, Kafka, StereamSets,
Storm
Open Data: Government
Data Storage
HDFS, S3, Object Storage,
NoSQL DB (Couchbase)…
Data Filtering
Hive, Pig
Data Analysis and Science
Hive, Pig, Spark, BI Tools
(Qlik, Tableau, …)
Data Visualization
Qlik, Excel PowerMap,
Tableau, Looker, …
- Big Data Engineering
- Big Data Analysis
- Big Data Science
- Data Visualization
22. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Terms
We know
Data Engineering
– Collect, clean, transform, filter data
Data Analysis
– Find insights from the existing data
Data Science (Predictive Analysis)
– Predict the trend or pattern from the existing data
Do we know?
Big Data Analysis and Science
– Using Big Data for Data Analysis and Science
• Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,..
– For Massive Data Set
• How to store and compute?
23. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Big Data Science
Fraud Detection:
Accepted to APJIS journal by Jongwook Woo et al in 2018
– Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat
– Indexed SCOPUS
Goal
Analyzing Transaction data and Fraud Detection
– For Mobile Money Transaction
• based on a sample of real transactions
– extracted from one month of financial logs from a mobile money
service
– using Spark ML (Big Data) and Azure ML (Traditional)
24. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Financial Data Set
Data is always issue
No public available datasets on financial services
– Private nature of financial transactions
PaySim
– URL: https://www.kaggle.com/ntnu-testimon/paysim1
– generate a synthetic dataset
• from the private dataset
– that resembles the normal operation of transactions
25. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Financial Data Set (Cont‘d)
Size: 470 MB (=> 718MB)
6,362,620 records
Not that large scale data comparing to data set > GB
But its architecture here can be applicable to much bigger data set
– As it still adopt Spark Computing Engine in Big Data
– Linearly scalable
Attributes: 11
Target Column to Predict:
‘isFraud’
26. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experiment Environment:
Traditional Systems and Big Data
27. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experiment Environment
Azure ML:
Traditional small data set
Implement fundamental prediction models
– Using Sample data: 80MB (1/5 – 1/6 data set)
Select the best model among number of classifications
28. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experiment Environment (Cont‘d)
Spark ML
Test with Databricks CE and IBM Cloud
– 470 MB
AWS EMR
– Analyze all data
• 470 MB (=> 718MB)
– Implement and evaluate prediction model
• 3 different models
• Spark Clusters with 3 different # of nodes
29. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Hardware Specifications: Spark
IBM DSX Lite
Python 2, Spark 2.1
File System: Object Storage
2 Spark Executors, 16GB Memory
Databricks
Python 2, Spark 2.1 (Auto-updating, Scala 2.10)
File System : Databricks File System
Single/Unlimited Cluster, Memory : 6GB Memory
30. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experiment Environment
AWS EMR
EMR 12.1
– Spark 2.2.1 on Hadoop 2.8.3
– YARN with Ganglia 3.7.2 and Zeppelin 0.7.3.
m3.xlarge instance
– Memory: 15.0 GiB,
– CPU: 4 vCPUs,
– Storage: 80 GiB (2 * 40 GiB SSD).
File System : S3
3 different EMR clusters
– number of nodes that are servers:
• 3, 6, 11 nodes
31. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
PySpark on Databricks
32. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Work Flow in Azure ML
Relatively Easy to build and test
Drag and Drop GUI
Work Flow
1. Data Engineering
– Understanding Data
– Data preparation
– Balancing data statistically
2. Data Science: Machine Learning (ML)
– Model building and validation
• Classification algorithms
– Model evaluation
– Model interpretation
33. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Data Understanding
• Numeric attributes:
amount, oldbalanceOrg, newbalanceOrg, oldbalanceDest,
newbalanceDest
• Categorical attributes:
step, type, isFraud, isFlaggedFraud
• String attributes:
nameOrig, nameDest
34. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experiment in Azure ML
35. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Precision vs Recall
True Positive (TP): Fraud? Yes it is
False Negative (FN): No fraud? but it is
False Positive (FP): Fraud? but it is not
Precision
TP / (TP + FP)
Recall
TP / (TP + FN)
Ref: https://en.wikipedia.org/wiki/Precision_and_recall
Positive:
Event occurs
(Fraud)
Negative: Event
does not
Occur (non
Fraud)
36. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Model Evaluation
More into Recall
to capture the most fraudelent transactions
Bad Recall: Fatal
–If many false negative (FN)
• predict the transaction as normal not fraud
– but it is a fraud
–Painful
• Need to decrease FN
– That is to increase Recall
37. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experimental Results in AzureML
Model Accuracy Precision Recall
Two Class Logistic Regression
Two Class Decision Forest
Two Class Decision Jungle 0.916 0.998
38. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experimental Results
Accuracy
Decision Jungle
– Highest Recall 0.998
• While Precision: 0.916
– With small sample data set: 359KB
• takes 11 sec
Performance:
Times taken to build a model with whole data set:
– 470MB + data tweaking
– Over a day
Good Guide
to adopt the 3 similar algorithms for Spark ML
– Decision Tree, Random Forest, Logistic Regression
39. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experiment with Spark ML
1. Load the data source
470 MB (=> 718MB)
2. Train and build the models
o Balanced data statistically
3. Evaluate
40. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Define the pipeline
41. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Train the models
42. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Pipeline Classification with Spark ML
Feature
Transformer
ParamMap
Estimator
Classification
Estimator
Classification
Evaluator
Validation
Estimator
Model
Transformer
Classification
Evaluator
43. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Pipeline Classification with Spark ML
Feature
Transformer
ParamMap
Estimator
Classification
Estimator
Classification
Evaluator
Model
Transformer
Classification
Evaluator
Feature is
generated from
input columns
Validation
Estimator
44. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Pipeline Classification with Spark ML
Feature
Transformer
ParamMap
Estimator
Classification
Estimator
Classification
Evaluator
Model
Transformer
Classification
Evaluator
Classifiers: Decision
Tree, RandomForest,
LogisticRegression
Validation
Estimator
45. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Pipeline Classification with Spark ML
Feature
Transformer
ParamMap
Estimator
Classification
Estimator
Classification
Evaluator
Model
Transformer
Classification
Evaluator
Combination of
Parameters: Max
Bins, Max Depth,…
Validation
Estimator
46. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Pipeline Classification with Spark ML
Feature
Transformer
ParamMap
Estimator
Classification
Estimator
Classification
Evaluator
Model
Transformer
Classification
Evaluator
Validators: Cross
Validator, Train
Validation Split
Validation
Estimator
47. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Results
Model Area under
ROC
Precision Recall
DecisionTreeClassifier
RandomForestClassifier 0.909573
LogisticRegression
• 3 models with different combinations of the parameters
• Times taken (Spark Cluster): 1 hour
• In theory of Linear Scalability: 2 minutes with 30 Spark clsters
• The Random Forest has the best recall score
• compared to Decision Tree and Logistic Regression.
48. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experimental Results in AWS
Execution times
3 nodes:
–40min – 70mins
11 nodes
–10min – 20mins
49. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
Smart Factory with Big Data
Summary
50. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Summary
Introduction to Big Data
Introduction to Big Data Predictive Analytics
Experimental Result of Fraud Detection
Recall:
– RandomForest in SparkML
– DecisionJungle in AzureML
Performance:
– Traditional Systems:
• not good for large scale data
– Spark ML:
• Linearly Scalable
• Fast
51. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Questions?
52. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
References
1. “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang
Xu, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications
(PDPTA 2011), Las Vegas (July 18-21, 2011)
2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley
Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-
452, ISSN 1942-4795
3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016
4. How to choose algorithms for Microsoft Azure Machine Learning, https://docs.microsoft.com/en-
us/azure/machine-learning/machine-learning-algorithm-choice
5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag Chhadva, Shubhra
Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-Data-Analysis-using-Spark.pdf
6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html
7. (Accepted in Sept 2018) Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat, Jongwook Woo,
"Predictive Analysis of Financial Fraud Detection using Azure and Spark ML", Asia Pacific Journal of
Information Systems
53. Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
References
8. TensorFrames: Google Tensorflow on Apache Spark, https://www.slideshare.net/databricks/tensorframes-
google-tensorflow-on-apache-spark
9. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-and-apache-
spark
10. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark,
https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-frameworks-on-
spark
11. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark,
https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-at-scalewith-
apache-spark-keynote-by-ziya-ma
12. Deep Learning with Apache Spark and TensorFlow, https://databricks.com/blog/2016/01/25/deep-
learning-with-apache-spark-and-tensorflow.html
13. Tensor Flow Deep Learning Open SAP
14. Overview of Smart Factory, https://www.slideshare.net/BrendanSheppard1/overview-of-smart-factory-
solutions-68137094/6
15. https://dzone.com/articles/sqoop-import-data-from-mysql-tohive