SlideShare a Scribd company logo
1 of 53
Jongwook Woo
BigDAI
HiPIC
CalStateLA
IDEAS SoCal Conf 2018
Oct 20 2018
Jongwook Woo, PhD, jwoo5@calstatela.edu
Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat
Big Data AI Center (BigDAI / HiPIC)
California State University Los Angeles
Predictive Analysis of Financial Fraud
Detection using Azure and Spark ML
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Introduction To Big Data Predictive Analytics
 Fraud Detection Predictive Analytics
 Summary
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Myself
Experience:
 Since 2002, Professor at California State University Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
 Since 1998: R&D consulting in Hollywood
– Warner Bros (Matrix online game), E!, citysearch.com, ARM etc
– Information Search and Integration with FAST, Lucene/Solr, Sphinx
– implements eBusiness applications using J2EE and middleware
 Since 2007: Exposed to Big Data at CitySearch.com
 2012 - Present : Big Data Academic Partnerships
– For Big Data research and training
• Amazon AWS, MicroSoft Azure, IBM Bluemix
• Databricks, Hadoop vendors
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Myself: Partners for Services
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experience in Big Data
 Collaboration
 Big Data Technical Advisor of Isaac Engineering for Smart * (Factory, Farms, …) in Korea
 Council Member of IBM Spark Technology Center
 City of Los Angeles for DSF, OpenHub and Open Data
 Startup Companies in Los Angeles
 External Collaborator and Advisor in Big Data
– IMSC of USC
– Pennsylvania State University
– The Big Link, Softzen, Wiken in Korea
 Grants
 Oracle Cloud Big Data, IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research
and Education Grant
 Partnership
 Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS,
Teradata
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Myself: Public Partners
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Myself: S/W Development Lead
http://www.mobygames.com/game/windows/matrix-online/credits
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Introduction To Big Data Predictive Analytics
 Fraud Detection Predictive Analytics
 Summary
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Smart *: Sensor Data (IoT), Bioinformatics, Social Computing, Streaming
data, smart phone, online game…
Cannot handle with the legacy approach
Too big
Non-/Semi-structured data
Too expensive
Need new systems
Non-expensive
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
What is Hadoop?
11
 Hadoop Founder:
o Doug Cutting
 Apache Committer:
Lucene, Nutch, …
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Super Computer vs Hadoop
Parallel vs. Distributed file systems by Michael Malak
Updated by Jongwook Woo
Cluster for Store Cluster for Compute/Store
Cluster for Compute
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Hadoop Cluster: Logical Diagram
Web Browser of Cluster nonitor: CM/Ambari
HTTP(S)
Agent Hadoop Agent Hadoop Agent Hadoop
Agent Hadoop Agent Hadoop Agent Hadoop
Cluster Monitor
.
.
.
.
.
.
.
.
.
Agent Hadoop Agent Hadoop Agent Hadoop
HDFS HDFS HDFS
HDFS HDFS HDFS
HIVE ZooKeeper Impala
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Hadoop Ecosystems
http://dawn.dbsdataprojects.com/tag/hadoop/
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Definition: Big Data
Non-expensive frameworks that is distributed parallel systems
and that can store a large scale data and process it in parallel [1,
2]
Hadoop
– Non-expensive Super Computer
– More public than the traditional super computers
• You can store and process your applications
– In your university labs, small companies, research centers
Others
– NoSQL DB (Cassandra, MongoDB, Redis, HBase)
– ElasticSearch
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Introduction To Big Data Predictive Analytics
 Fraud Detection Predictive Analytics
 Summary
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
 In-Memory storage for intermediate data
 20 ~ 100 times faster than N/W and Disk
– MapReduce
Good in Machine Learning
– Iterative algorithms
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Integrating Spark and Hadoop
 Spark
 File Systems: Tachyon
 Resource Manager: Mesos
 Dedicated Spark
– Cassandra, Couchbase…
 Integrating Spark into Hadoop cluster
 As Hadoop has been in the market for over 10 years
 Cloud Computing
– Oracle Cloud Big Data Compute, Amazon AWS, Azure HDInsight, IBM Bluemix, Google
Cloud Platform
• Object Storage, S3
 Hadoop vendors
– HDP, CDH
 Databricks: Spark on AWS & Azure
– Not much Hadoop ecosystems
Big Data AI Center (BDAIC / HiPIC)
Jongwook Woo
CalStateLA
Spark
Spark SQL
Querying using SQL, HiveQL
Data Frame
Spark Streaming
DStream
– RDD in streaming
ML
Machine Learning on Data Frame, Pipelining
MLib
– On RDD
– Sparse vector support, Decision trees, Linear/Logistic Regression, PCA,
SVM, …
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Introduction To Big Data Predictive Analytics
 Fraud Detection Predictive Analytics
 Summary
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Big Data Analysis and Prediction Flow
Data Collection
Batch API: Yelp, Google
Streaming: Twitter, Apache
NiFi, Kafka, StereamSets,
Storm
Open Data: Government
Data Storage
HDFS, S3, Object Storage,
NoSQL DB (Couchbase)…
Data Filtering
Hive, Pig
Data Analysis and Science
Hive, Pig, Spark, BI Tools
(Qlik, Tableau, …)
Data Visualization
Qlik, Excel PowerMap,
Tableau, Looker, …
- Big Data Engineering
- Big Data Analysis
- Big Data Science
- Data Visualization
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Terms
We know
Data Engineering
– Collect, clean, transform, filter data
Data Analysis
– Find insights from the existing data
Data Science (Predictive Analysis)
– Predict the trend or pattern from the existing data
Do we know?
Big Data Analysis and Science
– Using Big Data for Data Analysis and Science
• Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,..
– For Massive Data Set
• How to store and compute?
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Big Data Science
 Fraud Detection:
Accepted to APJIS journal by Jongwook Woo et al in 2018
– Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat
– Indexed SCOPUS
Goal
Analyzing Transaction data and Fraud Detection
– For Mobile Money Transaction
• based on a sample of real transactions
– extracted from one month of financial logs from a mobile money
service
– using Spark ML (Big Data) and Azure ML (Traditional)
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Financial Data Set
 Data is always issue
 No public available datasets on financial services
– Private nature of financial transactions
PaySim
– URL: https://www.kaggle.com/ntnu-testimon/paysim1
– generate a synthetic dataset
• from the private dataset
– that resembles the normal operation of transactions
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Financial Data Set (Cont‘d)
Size: 470 MB (=> 718MB)
6,362,620 records
Not that large scale data comparing to data set > GB
But its architecture here can be applicable to much bigger data set
– As it still adopt Spark Computing Engine in Big Data
– Linearly scalable
Attributes: 11
Target Column to Predict:
‘isFraud’
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experiment Environment:
Traditional Systems and Big Data
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experiment Environment
Azure ML:
Traditional small data set
Implement fundamental prediction models
– Using Sample data: 80MB (1/5 – 1/6 data set)
Select the best model among number of classifications
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experiment Environment (Cont‘d)
Spark ML
Test with Databricks CE and IBM Cloud
– 470 MB
AWS EMR
– Analyze all data
• 470 MB (=> 718MB)
– Implement and evaluate prediction model
• 3 different models
• Spark Clusters with 3 different # of nodes
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Hardware Specifications: Spark
IBM DSX Lite
Python 2, Spark 2.1
File System: Object Storage
2 Spark Executors, 16GB Memory
Databricks
Python 2, Spark 2.1 (Auto-updating, Scala 2.10)
File System : Databricks File System
Single/Unlimited Cluster, Memory : 6GB Memory
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experiment Environment
AWS EMR
EMR 12.1
– Spark 2.2.1 on Hadoop 2.8.3
– YARN with Ganglia 3.7.2 and Zeppelin 0.7.3.
 m3.xlarge instance
– Memory: 15.0 GiB,
– CPU: 4 vCPUs,
– Storage: 80 GiB (2 * 40 GiB SSD).
 File System : S3
3 different EMR clusters
– number of nodes that are servers:
• 3, 6, 11 nodes
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
PySpark on Databricks
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Work Flow in Azure ML
 Relatively Easy to build and test
Drag and Drop GUI
Work Flow
1. Data Engineering
– Understanding Data
– Data preparation
– Balancing data statistically
2. Data Science: Machine Learning (ML)
– Model building and validation
• Classification algorithms
– Model evaluation
– Model interpretation
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Data Understanding
• Numeric attributes:
amount, oldbalanceOrg, newbalanceOrg, oldbalanceDest,
newbalanceDest
• Categorical attributes:
step, type, isFraud, isFlaggedFraud
• String attributes:
nameOrig, nameDest
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experiment in Azure ML
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Precision vs Recall
True Positive (TP): Fraud? Yes it is
False Negative (FN): No fraud? but it is
False Positive (FP): Fraud? but it is not
 Precision
 TP / (TP + FP)
 Recall
 TP / (TP + FN)
 Ref: https://en.wikipedia.org/wiki/Precision_and_recall
Positive:
Event occurs
(Fraud)
Negative: Event
does not
Occur (non
Fraud)
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Model Evaluation
More into Recall
to capture the most fraudelent transactions
Bad Recall: Fatal
–If many false negative (FN)
• predict the transaction as normal not fraud
– but it is a fraud
–Painful
• Need to decrease FN
– That is to increase Recall
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experimental Results in AzureML
Model Accuracy Precision Recall
Two Class Logistic Regression
Two Class Decision Forest
Two Class Decision Jungle 0.916 0.998
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experimental Results
Accuracy
Decision Jungle
– Highest Recall 0.998
• While Precision: 0.916
– With small sample data set: 359KB
• takes 11 sec
Performance:
Times taken to build a model with whole data set:
– 470MB + data tweaking
– Over a day
Good Guide
to adopt the 3 similar algorithms for Spark ML
– Decision Tree, Random Forest, Logistic Regression
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experiment with Spark ML
1. Load the data source
 470 MB (=> 718MB)
2. Train and build the models
o Balanced data statistically
3. Evaluate
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Define the pipeline
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Train the models
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Pipeline Classification with Spark ML
Feature
Transformer
ParamMap
Estimator
Classification
Estimator
Classification
Evaluator
Validation
Estimator
Model
Transformer
Classification
Evaluator
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Pipeline Classification with Spark ML
Feature
Transformer
ParamMap
Estimator
Classification
Estimator
Classification
Evaluator
Model
Transformer
Classification
Evaluator
Feature is
generated from
input columns
Validation
Estimator
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Pipeline Classification with Spark ML
Feature
Transformer
ParamMap
Estimator
Classification
Estimator
Classification
Evaluator
Model
Transformer
Classification
Evaluator
Classifiers: Decision
Tree, RandomForest,
LogisticRegression
Validation
Estimator
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Pipeline Classification with Spark ML
Feature
Transformer
ParamMap
Estimator
Classification
Estimator
Classification
Evaluator
Model
Transformer
Classification
Evaluator
Combination of
Parameters: Max
Bins, Max Depth,…
Validation
Estimator
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Pipeline Classification with Spark ML
Feature
Transformer
ParamMap
Estimator
Classification
Estimator
Classification
Evaluator
Model
Transformer
Classification
Evaluator
Validators: Cross
Validator, Train
Validation Split
Validation
Estimator
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Results
Model Area under
ROC
Precision Recall
DecisionTreeClassifier
RandomForestClassifier 0.909573
LogisticRegression
• 3 models with different combinations of the parameters
• Times taken (Spark Cluster): 1 hour
• In theory of Linear Scalability: 2 minutes with 30 Spark clsters
• The Random Forest has the best recall score
• compared to Decision Tree and Logistic Regression.
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experimental Results in AWS
Execution times
3 nodes:
–40min – 70mins
11 nodes
–10min – 20mins
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Smart Factory with Big Data
 Summary
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Summary
Introduction to Big Data
Introduction to Big Data Predictive Analytics
Experimental Result of Fraud Detection
Recall:
– RandomForest in SparkML
– DecisionJungle in AzureML
Performance:
– Traditional Systems:
• not good for large scale data
– Spark ML:
• Linearly Scalable
• Fast
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Questions?
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
References
1. “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang
Xu, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications
(PDPTA 2011), Las Vegas (July 18-21, 2011)
2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley
Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-
452, ISSN 1942-4795
3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016
4. How to choose algorithms for Microsoft Azure Machine Learning, https://docs.microsoft.com/en-
us/azure/machine-learning/machine-learning-algorithm-choice
5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag Chhadva, Shubhra
Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-Data-Analysis-using-Spark.pdf
6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html
7. (Accepted in Sept 2018) Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat, Jongwook Woo,
"Predictive Analysis of Financial Fraud Detection using Azure and Spark ML", Asia Pacific Journal of
Information Systems
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
References
8. TensorFrames: Google Tensorflow on Apache Spark, https://www.slideshare.net/databricks/tensorframes-
google-tensorflow-on-apache-spark
9. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-and-apache-
spark
10. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark,
https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-frameworks-on-
spark
11. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark,
https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-at-scalewith-
apache-spark-keynote-by-ziya-ma
12. Deep Learning with Apache Spark and TensorFlow, https://databricks.com/blog/2016/01/25/deep-
learning-with-apache-spark-and-tensorflow.html
13. Tensor Flow Deep Learning Open SAP
14. Overview of Smart Factory, https://www.slideshare.net/BrendanSheppard1/overview-of-smart-factory-
solutions-68137094/6
15. https://dzone.com/articles/sqoop-import-data-from-mysql-tohive

More Related Content

What's hot

Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataJongwook Woo
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraJongwook Woo
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligenceManish Jain
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data ScienceKenny Daniel
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceANOOP V S
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligencehktripathy
 
Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Benjamin Taylor
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)heba_ahmad
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceEdureka!
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?Gregory Piatetsky-Shapiro
 

What's hot (20)

AI on Big Data
AI on Big DataAI on Big Data
AI on Big Data
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial Intelligence
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
Data mining
Data miningData mining
Data mining
 
Data science
Data scienceData science
Data science
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligence
 
Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Predictive analytics and big data tutorial
Predictive analytics and big data tutorial
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?
 

Similar to Predictive Analysis of Financial Fraud Detection using Azure and Spark ML

Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open DataJongwook Woo
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsJongwook Woo
 
Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open DataJongwook Woo
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017Jongwook Woo
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Jongwook Woo
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open PlatformJongwook Woo
 
Information Security Analytics
Information Security AnalyticsInformation Security Analytics
Information Security AnalyticsAmrit Chhetri
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingJongwook Woo
 
Introduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopIntroduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopJongwook Woo
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusersBob Hardaway
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdfPoornimaShetty27
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdfSreenivasa Harish
 
Big Data - A Real Life Revolution
Big Data - A Real Life RevolutionBig Data - A Real Life Revolution
Big Data - A Real Life RevolutionCapgemini
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop siliconsudipt
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksJongwook Woo
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Jongwook Woo
 
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Edureka!
 

Similar to Predictive Analysis of Financial Fraud Detection using Azure and Spark ML (20)

Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open Data
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
 
Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open Data
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Information Security Analytics
Information Security AnalyticsInformation Security Analytics
Information Security Analytics
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive Computing
 
Introduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopIntroduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on Hadoop
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
 
Big Data - A Real Life Revolution
Big Data - A Real Life RevolutionBig Data - A Real Life Revolution
Big Data - A Real Life Revolution
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on Networks
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015
 
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
 

More from Jongwook Woo

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum ComputingJongwook Woo
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeJongwook Woo
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Big Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkBig Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkJongwook Woo
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Jongwook Woo
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsJongwook Woo
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesJongwook Woo
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopJongwook Woo
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in SeoulJongwook Woo
 

More from Jongwook Woo (11)

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum Computing
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure ML
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Big Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkBig Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using Spark
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use Cases
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using Hadoop
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul
 

Recently uploaded

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjurptikerjasaptiker
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........EfruzAsilolu
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 

Recently uploaded (20)

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 

Predictive Analysis of Financial Fraud Detection using Azure and Spark ML

  • 1. Jongwook Woo BigDAI HiPIC CalStateLA IDEAS SoCal Conf 2018 Oct 20 2018 Jongwook Woo, PhD, jwoo5@calstatela.edu Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat Big Data AI Center (BigDAI / HiPIC) California State University Los Angeles Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
  • 2. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Introduction To Big Data Predictive Analytics  Fraud Detection Predictive Analytics  Summary
  • 3. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Myself Experience:  Since 2002, Professor at California State University Los Angeles – PhD in 2001: Computer Science and Engineering at USC  Since 1998: R&D consulting in Hollywood – Warner Bros (Matrix online game), E!, citysearch.com, ARM etc – Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware  Since 2007: Exposed to Big Data at CitySearch.com  2012 - Present : Big Data Academic Partnerships – For Big Data research and training • Amazon AWS, MicroSoft Azure, IBM Bluemix • Databricks, Hadoop vendors
  • 4. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Myself: Partners for Services
  • 5. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experience in Big Data  Collaboration  Big Data Technical Advisor of Isaac Engineering for Smart * (Factory, Farms, …) in Korea  Council Member of IBM Spark Technology Center  City of Los Angeles for DSF, OpenHub and Open Data  Startup Companies in Los Angeles  External Collaborator and Advisor in Big Data – IMSC of USC – Pennsylvania State University – The Big Link, Softzen, Wiken in Korea  Grants  Oracle Cloud Big Data, IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant  Partnership  Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS, Teradata
  • 6. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Myself: Public Partners
  • 7. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Myself: S/W Development Lead http://www.mobygames.com/game/windows/matrix-online/credits
  • 8. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Introduction To Big Data Predictive Analytics  Fraud Detection Predictive Analytics  Summary
  • 9. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Smart *: Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data, smart phone, online game… Cannot handle with the legacy approach Too big Non-/Semi-structured data Too expensive Need new systems Non-expensive
  • 10. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – Distributed Systems on non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with non-expensive computers Own super computers Published papers in 2003, 2004
  • 11. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA What is Hadoop? 11  Hadoop Founder: o Doug Cutting  Apache Committer: Lucene, Nutch, …
  • 12. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Super Computer vs Hadoop Parallel vs. Distributed file systems by Michael Malak Updated by Jongwook Woo Cluster for Store Cluster for Compute/Store Cluster for Compute
  • 13. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Hadoop Cluster: Logical Diagram Web Browser of Cluster nonitor: CM/Ambari HTTP(S) Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Cluster Monitor . . . . . . . . . Agent Hadoop Agent Hadoop Agent Hadoop HDFS HDFS HDFS HDFS HDFS HDFS HIVE ZooKeeper Impala
  • 14. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Hadoop Ecosystems http://dawn.dbsdataprojects.com/tag/hadoop/
  • 15. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Definition: Big Data Non-expensive frameworks that is distributed parallel systems and that can store a large scale data and process it in parallel [1, 2] Hadoop – Non-expensive Super Computer – More public than the traditional super computers • You can store and process your applications – In your university labs, small companies, research centers Others – NoSQL DB (Cassandra, MongoDB, Redis, HBase) – ElasticSearch
  • 16. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Introduction To Big Data Predictive Analytics  Fraud Detection Predictive Analytics  Summary
  • 17. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Alternate of Hadoop MapReduce Limitation in MapReduce Hard to program in Java Batch Processing – Not interactive Disk storage for intermediate data – Performance issue Spark by UC Berkley AMP Lab  In-Memory storage for intermediate data  20 ~ 100 times faster than N/W and Disk – MapReduce Good in Machine Learning – Iterative algorithms
  • 18. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Integrating Spark and Hadoop  Spark  File Systems: Tachyon  Resource Manager: Mesos  Dedicated Spark – Cassandra, Couchbase…  Integrating Spark into Hadoop cluster  As Hadoop has been in the market for over 10 years  Cloud Computing – Oracle Cloud Big Data Compute, Amazon AWS, Azure HDInsight, IBM Bluemix, Google Cloud Platform • Object Storage, S3  Hadoop vendors – HDP, CDH  Databricks: Spark on AWS & Azure – Not much Hadoop ecosystems
  • 19. Big Data AI Center (BDAIC / HiPIC) Jongwook Woo CalStateLA Spark Spark SQL Querying using SQL, HiveQL Data Frame Spark Streaming DStream – RDD in streaming ML Machine Learning on Data Frame, Pipelining MLib – On RDD – Sparse vector support, Decision trees, Linear/Logistic Regression, PCA, SVM, …
  • 20. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Introduction To Big Data Predictive Analytics  Fraud Detection Predictive Analytics  Summary
  • 21. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Big Data Analysis and Prediction Flow Data Collection Batch API: Yelp, Google Streaming: Twitter, Apache NiFi, Kafka, StereamSets, Storm Open Data: Government Data Storage HDFS, S3, Object Storage, NoSQL DB (Couchbase)… Data Filtering Hive, Pig Data Analysis and Science Hive, Pig, Spark, BI Tools (Qlik, Tableau, …) Data Visualization Qlik, Excel PowerMap, Tableau, Looker, … - Big Data Engineering - Big Data Analysis - Big Data Science - Data Visualization
  • 22. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Terms We know Data Engineering – Collect, clean, transform, filter data Data Analysis – Find insights from the existing data Data Science (Predictive Analysis) – Predict the trend or pattern from the existing data Do we know? Big Data Analysis and Science – Using Big Data for Data Analysis and Science • Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,.. – For Massive Data Set • How to store and compute?
  • 23. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Big Data Science  Fraud Detection: Accepted to APJIS journal by Jongwook Woo et al in 2018 – Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat – Indexed SCOPUS Goal Analyzing Transaction data and Fraud Detection – For Mobile Money Transaction • based on a sample of real transactions – extracted from one month of financial logs from a mobile money service – using Spark ML (Big Data) and Azure ML (Traditional)
  • 24. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Financial Data Set  Data is always issue  No public available datasets on financial services – Private nature of financial transactions PaySim – URL: https://www.kaggle.com/ntnu-testimon/paysim1 – generate a synthetic dataset • from the private dataset – that resembles the normal operation of transactions
  • 25. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Financial Data Set (Cont‘d) Size: 470 MB (=> 718MB) 6,362,620 records Not that large scale data comparing to data set > GB But its architecture here can be applicable to much bigger data set – As it still adopt Spark Computing Engine in Big Data – Linearly scalable Attributes: 11 Target Column to Predict: ‘isFraud’
  • 26. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment Environment: Traditional Systems and Big Data
  • 27. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment Environment Azure ML: Traditional small data set Implement fundamental prediction models – Using Sample data: 80MB (1/5 – 1/6 data set) Select the best model among number of classifications
  • 28. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment Environment (Cont‘d) Spark ML Test with Databricks CE and IBM Cloud – 470 MB AWS EMR – Analyze all data • 470 MB (=> 718MB) – Implement and evaluate prediction model • 3 different models • Spark Clusters with 3 different # of nodes
  • 29. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Hardware Specifications: Spark IBM DSX Lite Python 2, Spark 2.1 File System: Object Storage 2 Spark Executors, 16GB Memory Databricks Python 2, Spark 2.1 (Auto-updating, Scala 2.10) File System : Databricks File System Single/Unlimited Cluster, Memory : 6GB Memory
  • 30. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment Environment AWS EMR EMR 12.1 – Spark 2.2.1 on Hadoop 2.8.3 – YARN with Ganglia 3.7.2 and Zeppelin 0.7.3.  m3.xlarge instance – Memory: 15.0 GiB, – CPU: 4 vCPUs, – Storage: 80 GiB (2 * 40 GiB SSD).  File System : S3 3 different EMR clusters – number of nodes that are servers: • 3, 6, 11 nodes
  • 31. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA PySpark on Databricks
  • 32. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Work Flow in Azure ML  Relatively Easy to build and test Drag and Drop GUI Work Flow 1. Data Engineering – Understanding Data – Data preparation – Balancing data statistically 2. Data Science: Machine Learning (ML) – Model building and validation • Classification algorithms – Model evaluation – Model interpretation
  • 33. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Data Understanding • Numeric attributes: amount, oldbalanceOrg, newbalanceOrg, oldbalanceDest, newbalanceDest • Categorical attributes: step, type, isFraud, isFlaggedFraud • String attributes: nameOrig, nameDest
  • 34. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment in Azure ML
  • 35. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Precision vs Recall True Positive (TP): Fraud? Yes it is False Negative (FN): No fraud? but it is False Positive (FP): Fraud? but it is not  Precision  TP / (TP + FP)  Recall  TP / (TP + FN)  Ref: https://en.wikipedia.org/wiki/Precision_and_recall Positive: Event occurs (Fraud) Negative: Event does not Occur (non Fraud)
  • 36. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Model Evaluation More into Recall to capture the most fraudelent transactions Bad Recall: Fatal –If many false negative (FN) • predict the transaction as normal not fraud – but it is a fraud –Painful • Need to decrease FN – That is to increase Recall
  • 37. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experimental Results in AzureML Model Accuracy Precision Recall Two Class Logistic Regression Two Class Decision Forest Two Class Decision Jungle 0.916 0.998
  • 38. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experimental Results Accuracy Decision Jungle – Highest Recall 0.998 • While Precision: 0.916 – With small sample data set: 359KB • takes 11 sec Performance: Times taken to build a model with whole data set: – 470MB + data tweaking – Over a day Good Guide to adopt the 3 similar algorithms for Spark ML – Decision Tree, Random Forest, Logistic Regression
  • 39. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment with Spark ML 1. Load the data source  470 MB (=> 718MB) 2. Train and build the models o Balanced data statistically 3. Evaluate
  • 40. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Define the pipeline
  • 41. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Train the models
  • 42. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer ParamMap Estimator Classification Estimator Classification Evaluator Validation Estimator Model Transformer Classification Evaluator
  • 43. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer ParamMap Estimator Classification Estimator Classification Evaluator Model Transformer Classification Evaluator Feature is generated from input columns Validation Estimator
  • 44. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer ParamMap Estimator Classification Estimator Classification Evaluator Model Transformer Classification Evaluator Classifiers: Decision Tree, RandomForest, LogisticRegression Validation Estimator
  • 45. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer ParamMap Estimator Classification Estimator Classification Evaluator Model Transformer Classification Evaluator Combination of Parameters: Max Bins, Max Depth,… Validation Estimator
  • 46. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer ParamMap Estimator Classification Estimator Classification Evaluator Model Transformer Classification Evaluator Validators: Cross Validator, Train Validation Split Validation Estimator
  • 47. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Results Model Area under ROC Precision Recall DecisionTreeClassifier RandomForestClassifier 0.909573 LogisticRegression • 3 models with different combinations of the parameters • Times taken (Spark Cluster): 1 hour • In theory of Linear Scalability: 2 minutes with 30 Spark clsters • The Random Forest has the best recall score • compared to Decision Tree and Logistic Regression.
  • 48. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experimental Results in AWS Execution times 3 nodes: –40min – 70mins 11 nodes –10min – 20mins
  • 49. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Smart Factory with Big Data  Summary
  • 50. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Summary Introduction to Big Data Introduction to Big Data Predictive Analytics Experimental Result of Fraud Detection Recall: – RandomForest in SparkML – DecisionJungle in AzureML Performance: – Traditional Systems: • not good for large scale data – Spark ML: • Linearly Scalable • Fast
  • 51. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Questions?
  • 52. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA References 1. “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2011), Las Vegas (July 18-21, 2011) 2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445- 452, ISSN 1942-4795 3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016 4. How to choose algorithms for Microsoft Azure Machine Learning, https://docs.microsoft.com/en- us/azure/machine-learning/machine-learning-algorithm-choice 5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag Chhadva, Shubhra Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-Data-Analysis-using-Spark.pdf 6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html 7. (Accepted in Sept 2018) Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat, Jongwook Woo, "Predictive Analysis of Financial Fraud Detection using Azure and Spark ML", Asia Pacific Journal of Information Systems
  • 53. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA References 8. TensorFrames: Google Tensorflow on Apache Spark, https://www.slideshare.net/databricks/tensorframes- google-tensorflow-on-apache-spark 9. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-and-apache- spark 10. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark, https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-frameworks-on- spark 11. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark, https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-at-scalewith- apache-spark-keynote-by-ziya-ma 12. Deep Learning with Apache Spark and TensorFlow, https://databricks.com/blog/2016/01/25/deep- learning-with-apache-spark-and-tensorflow.html 13. Tensor Flow Deep Learning Open SAP 14. Overview of Smart Factory, https://www.slideshare.net/BrendanSheppard1/overview-of-smart-factory- solutions-68137094/6 15. https://dzone.com/articles/sqoop-import-data-from-mysql-tohive