SlideShare a Scribd company logo
Jongwook Woo
HiPIC
CalStateLA
APIC-IST 2019
June 24 2019
Neha Gupta, Hai Anh Le,
Maria Boldina , Jongwook Woo
Big Data AI Center (BigDAI)
California State University Los Angeles
Predicting fraud of AD click
using
Traditional and Spark ML
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Introduction
 Data Set
 Data Fields Details
 Experiment Environment: Traditional and Big Data Systems
Work Flow in Azure ML
Data Bricks : Data Engineering
Algorithms
Appendix
References
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Introduction
A person, automated script or computer program imitates a
legitimate user
clicking on an ad without having an actual interest in the target of the ad's
link
resulting in misleading click data and wasted money
Companies suffers from huge volumes of fraudulent traffic
Especially, in mobile market in the world
Goal
Predict who will download the apps
Using Classification model
Traditional and Big Data approach
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Introduction(Cont’d)
TalkingData
 China’s largest independent big data service platform
– covers over 70% of active mobile devices nationwide
 handles 3 billion clicks per day
– 90% of which are potentially fraudulent
 Goal of the Predictive Analysis
 Predict whether a user will download an app
– after clicking on a mobile app advertisement
 To better target the audience,
– to avoid fraudulent practices
– and save money
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Set
 Dataset: TalkingData AdTracking Fraud Detection
https://www.kaggle.com/c/talkingdata-adtracking-fraud-
detection/data
Dataset Property:
Original dataset size: 7GB
– contains 200 million clicks over 4 day period
Dataset format: csv
Fields: 8
– Target Column to Predict: ‘is_attributed’
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Fields Details
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experiment Environment:
Traditional and Big Data Systems
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experiment Environment: Traditional
Azure ML Studio:
Traditional for small data set
Free Workspace
10GB storage
Single node
Implement fundamental prediction models
– Using Sample data: 80MB (1.1% of the original data set)
Select the best model among number of classifications
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experiment Environment: Spark
Spark ML:
Data Filtering:
– 1 GB from 8 GB
• Implemented Python code to reduce size to 1GB (15%)
– We have experimental result with 8GB as well
• For another publication
Databricks Subscription
– Cluster 4.0 (includes Apache Spark 2.3.0, Scala 2.11)
• 2 Spark Workers with total of 16 GB Memory and 4 Cores
• Python 2.7
• File System : Databricks File System
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experiment Environment: Spark (Cont’d)
Oracle Big Data Spark Cluster
 Oracle BDCE
Python 2.7.x, Spark 2.1.x
 10 nodes,
– 20 OCPUs, 300GB Memory, 1,154GB Storage
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Work Flow in Azure ML
 Relatively Easy to build and test
Drag and Drop GUI
Work Flow
1. Data Engineering
– Understanding Data
– Data preparation
– Balancing data statistically
2. Data Science: Machine Learning (ML)
– Model building and validation
• Classification algorithms
– Model evaluation
– Model interpretation
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Engineering
Unbalanced dataset
1: 0.19% App downloaded
0: 99.81% App not
downloaded
1GB filtered dataset
still too large for the
traditional systems: Azure
ML Studio
More sampling needed for
Azure ML
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Engineering
 SMOTE: Synthetic Minority Over
Sampling Technique takes a subset of
data from the minority class and creates
new synthetic similar instances
 Helps balance data & avoid overfitting
 Increased percent of minority class (1) from
0.19% to 11%
 Stratified Split ensures that the output
dataset contains a representative
sample of the values in the selected
column
 Ensures that the random sample does not contain
all rows with just 0s
 8% sample used = 80 MB
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Algorithms in Azure ML Studio
 Two-Class Classification:
 classify the elements of a given set into two groups
– either downloaded, is_attributed (1)
– or not downloaded, is_attributed (0)
Decision trees
 often perform well on imbalanced datasets
– as their hierarchical structure allows them to learn signals from both classes.
Tree ensembles almost always outperform singular decision trees
– Algorithm #1: Two-class Decision Jungle
– Algorithm #2: Two-class Decision Forest
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Selecting Performance Metrics
False Positives indicate
the model predicted an app was downloaded when in fact it wasn’t
 Goal: minimize the FP => To save $$$
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
AZURE ML MODEL #1: TWO-CLASS DECISION JUNGLE
• 8% Sample
• SMOTE 5000%
• 70:30 Split
Train/Test
• Cross-Validation
• Tune Model
Hyperparameters
• Features used: all 7
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
AZURE ML MODEL #1: Tune Model Hyperparameters
Without Tune
Hyperparameters
With Tune
Hyperparameters
AUC = 0.905 vs 0.606
Precision = 1.0
TP = 35, FP = 0
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
AZURE ML MODEL #2: TWO-CLASS DECISION FOREST
• 8% Sample
• SMOTE 5000%
• 70:30 Split
Train/Test
• Cross-Validation
• Tune Model
Hyperparameters
• Permutation Feature
Importance
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
AZURE ML MODEL #2: Improving Precision
Precision
increased to 0.992
FP decreased from
1,659 to 377
FN increased from
1,834 to 5,142 By increasing
threshold from 0.5
to 0.8
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experimental Results in Azure ML Studio
Performance:
Execution time with sample data set: 1GB
Decision Forrest
– takes 2.5 hours
Decision Jungle
– takes 3 hours 19 min
Good Guide from the models of Azure ML Studio
 to adopt the 2 similar algorithms for Spark ML
– Decision Tree
– Random Forest
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experimental Results in AzureML
Two-class Decision Forest is the best model!
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experiment with Spark ML in Databricks
1. Load the data source
 1.03 GB
 Same filtered data set as Azure ML
2. Train and build the models
o Balanced data statistically
3. Evaluate
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Engineering
Generate features
Feature 1: extract day of the week and hour of the day from the click time
Feature 2: group clicks by combination of
– (Ip, Day_of_week_number and Hour)
Feature 3: group clicks by combination of
– (Ip, App, Operating System, Day_of_week_number and Hour)
Feature 4: group clicks by combination of
– (App, Day_of_week_number and Hour)
Feature 5: group clicks by combination of
– (Ip, App, Device and Operating System)
Feature 6: group clicks by combination of
– (Ip, Device and Operating System)
24
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark ML MODEL #1: Decision Tree Classifier
Confusion Matrix
25
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark ML MODEL #1: Random Forrest Classifier
Confusion Matrix
2626
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark ML Result Comparison
Decision Tree Classifier is relatively the better model!
Decision Tree
Classifier
Random Forest
Classifier
AUC 0.815 0.746
PRECISION 0.822 0.878
RECALL 0.633 0.495
TP 86,683 67,726
FP 18,727 9,408
TN 7,112,961 7,122,280
FN 50,074 69,031
RMSE 0.0972 0.1038
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experiment in Oracle Cluster
Oracle Big Data Spark Cluster
 10 nodes, 20 OCPUs, 300GB Memory, 1,154GB Storage
1. Load the data source
 1.03 GB
2. Sample the balanced data based on Downloaded
 116 MB
3. Train and build the models
o Balanced data statistically
4. Evaluate
2828
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Azure ML Studio and Spark ML Result Comparison
TWO-CLASS
DECISION
JUNGLE
(AzureML)
TWO-CLASS
DECISION
FOREST
(AzureML)
DECISION
TREE
CLASSIFIER
(Databricks
)
RANDOM
FOREST
CLASSIFIER
(Databricks
)
DECISION
TREE
CLASSIFIER
(Balanced
Sample Data,
Oracle)
RANDOM
FOREST
CLASSIFIER
(Balanced
Sample Data,
Oracle)
AUC 0.905 0.997 0.815 0.746 0.896 0.893
PRECISION 1.0 0.992 0.822 0.878 0.935 0.934
RECALL 0.001 0.902 0.633 0.495 0.807 0.800
TP 35 47,199 86,683 67,726 111,187 110,220
FP 0 377 18,727 9,408 7,712 7,791
TN 52,306 406,228 7,112,961 7,122,280 545,302 545,223
FN 406,605 5,142 50,074 69,031 26,604 27,571
Run Time 2 hrs 2-3 hrs 22 mins 50 mins 24 sec 2 mins
2929
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Azure ML Studio and Spark ML Result Comparison
TWO-CLASS
DECISION
JUNGLE
(AzureML)
TWO-CLASS
DECISION
FOREST
(AzureML)
DECISION
TREE
CLASSIFIER
(Databricks
)
RANDOM
FOREST
CLASSIFIER
(Databricks
)
DECISION
TREE
CLASSIFIER
(Balanced
Sample Data,
Oracle)
RANDOM
FOREST
CLASSIFIER
(Balanced
Sample Data,
Oracle)
AUC 0.905 0.997 0.815 0.746 0.896 0.893
PRECISION 1.0 0.992 0.822 0.878 0.935 0.934
RECALL 0.001 0.902 0.633 0.495 0.807 0.800
TP 35 47,199 86,683 67,726 111,187 110,220
FP 0 377 18,727 9,408 7,712 7,791
TN 52,306 406,228 7,112,961 7,122,280 545,302 545,223
FN 406,605 5,142 50,074 69,031 26,604 27,571
Run Time 2 hrs 2-3 hrs 22 mins 50 mins 24 sec 2 mins
• Azure ML Two-class Decision Forest is the best model!
• Spark ML code need to be updated for the better accuracy
• Balanced Sampling based on the fraud in Oracle:
• Decision Tree has 0.935 in Precision
• Execution Time: 24 secs
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Questions?
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Appendix
Data Set Details (Cont‘d)
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Precision vs Recall
True Positive (TP): Fraud? Yes it is
False Negative (FN): No fraud? but it is
False Positive (FP): Fraud? but it is not
 Precision
 TP / (TP + FP)
 Recall
 TP / (TP + FN)
 Ref: https://en.wikipedia.org/wiki/Precision_and_recall
Positive:
Event occurs
(Fraud)
Negative: Event
does not
Occur (non
Fraud)
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
References
1. Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat, Jongwook Woo, "Predictive Analysis of Financial
Fraud Detection using Azure and Spark ML", Asia Pacific Journal of Information Systems (APJIS),
VOL.28│NO.4│December 2018, pp308~319
2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley
Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-
452, ISSN 1942-4795
3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016
4. How to choose algorithms for Microsoft Azure Machine Learning, https://docs.microsoft.com/en-
us/azure/machine-learning/machine-learning-algorithm-choice
5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag Chhadva, Shubhra
Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-Data-Analysis-using-Spark.pdf
6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html
7. TensorFrames: Google Tensorflow on Apache Spark, https://www.slideshare.net/databricks/tensorframes-
google-tensorflow-on-apache-spark
8. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-and-apache-
spark
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
References
9. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark,
https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-frameworks-on-
spark
10. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark,
https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-at-scalewith-
apache-spark-keynote-by-ziya-ma
11. Deep Learning with Apache Spark and TensorFlow, https://databricks.com/blog/2016/01/25/deep-
learning-with-apache-spark-and-tensorflow.html
12. Tensor Flow Deep Learning Open SAP
13. Overview of Smart Factory, https://www.slideshare.net/BrendanSheppard1/overview-of-smart-factory-
solutions-68137094/6
14. https://dzone.com/articles/sqoop-import-data-from-mysql-tohive
15. https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/data
16. https://blogs.msdn.microsoft.com/andreasderuiter/2015/02/09/performance-measures-in-azure-ml-
accuracy-precision-recall-and-f1-score/

More Related Content

What's hot

Towards Automatic Composition of Multicomponent Predictive Systems
Towards Automatic Composition of Multicomponent Predictive SystemsTowards Automatic Composition of Multicomponent Predictive Systems
Towards Automatic Composition of Multicomponent Predictive Systems
Manuel Martín
 
(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...
Akram Pasha
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Databricks
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Robert Grossman
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
Robert Grossman
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET Journal
 
Quick presentation for the OpenML workshop in Eindhoven 2014
Quick presentation for the OpenML workshop in Eindhoven 2014Quick presentation for the OpenML workshop in Eindhoven 2014
Quick presentation for the OpenML workshop in Eindhoven 2014
Manuel Martín
 
Concept Drift Identification using Classifier Ensemble Approach
Concept Drift Identification using Classifier Ensemble Approach  Concept Drift Identification using Classifier Ensemble Approach
Concept Drift Identification using Classifier Ensemble Approach
IJECEIAES
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
Robert Grossman
 
A cyber physical stream algorithm for intelligent software defined storage
A cyber physical stream algorithm for intelligent software defined storageA cyber physical stream algorithm for intelligent software defined storage
A cyber physical stream algorithm for intelligent software defined storage
Made Artha
 
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
International Educational Applied Scientific Research Journal (IEASRJ)
 
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
IJDKP
 
[215]streetwise machine learning for painless parking
[215]streetwise machine learning for painless parking[215]streetwise machine learning for painless parking
[215]streetwise machine learning for painless parking
NAVER D2
 
As 7
As 7As 7
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Robert Grossman
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data Extraction
Dasha Herrmannova
 
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Keiichiro Ono
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
Allen Day, PhD
 
rerngvit_phd_seminar
rerngvit_phd_seminarrerngvit_phd_seminar
rerngvit_phd_seminar
rerngvit yanggratoke
 
Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python”
Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python” Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python”
Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python”
Lviv Startup Club
 

What's hot (20)

Towards Automatic Composition of Multicomponent Predictive Systems
Towards Automatic Composition of Multicomponent Predictive SystemsTowards Automatic Composition of Multicomponent Predictive Systems
Towards Automatic Composition of Multicomponent Predictive Systems
 
(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
 
Quick presentation for the OpenML workshop in Eindhoven 2014
Quick presentation for the OpenML workshop in Eindhoven 2014Quick presentation for the OpenML workshop in Eindhoven 2014
Quick presentation for the OpenML workshop in Eindhoven 2014
 
Concept Drift Identification using Classifier Ensemble Approach
Concept Drift Identification using Classifier Ensemble Approach  Concept Drift Identification using Classifier Ensemble Approach
Concept Drift Identification using Classifier Ensemble Approach
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
A cyber physical stream algorithm for intelligent software defined storage
A cyber physical stream algorithm for intelligent software defined storageA cyber physical stream algorithm for intelligent software defined storage
A cyber physical stream algorithm for intelligent software defined storage
 
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
 
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
 
[215]streetwise machine learning for painless parking
[215]streetwise machine learning for painless parking[215]streetwise machine learning for painless parking
[215]streetwise machine learning for painless parking
 
As 7
As 7As 7
As 7
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data Extraction
 
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
 
rerngvit_phd_seminar
rerngvit_phd_seminarrerngvit_phd_seminar
rerngvit_phd_seminar
 
Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python”
Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python” Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python”
Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python”
 

Similar to AdClickFraud_Bigdata-Apic-Ist-2019

Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformPredictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Savita Yadav
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
Jongwook Woo
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Jongwook Woo
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and Spark
Jongwook Woo
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
Jongwook Woo
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep Learning
Jongwook Woo
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive Analysis
Jongwook Woo
 
Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the Cloud
MapR Technologies
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Jongwook Woo
 
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
VMware Tanzu
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AI
Jongwook Woo
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Databricks
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its Trends
Jongwook Woo
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology
Intel® Software
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
Alex Liu
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
Pouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
Pouria Amirian
 
Database@Home : The Future is Data Driven
Database@Home : The Future is Data DrivenDatabase@Home : The Future is Data Driven
Database@Home : The Future is Data Driven
Tammy Bednar
 

Similar to AdClickFraud_Bigdata-Apic-Ist-2019 (20)

Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformPredictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and Spark
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep Learning
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive Analysis
 
Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the Cloud
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
 
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AI
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its Trends
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
 
AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Database@Home : The Future is Data Driven
Database@Home : The Future is Data DrivenDatabase@Home : The Future is Data Driven
Database@Home : The Future is Data Driven
 

Recently uploaded

Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 

Recently uploaded (20)

Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 

AdClickFraud_Bigdata-Apic-Ist-2019

  • 1. Jongwook Woo HiPIC CalStateLA APIC-IST 2019 June 24 2019 Neha Gupta, Hai Anh Le, Maria Boldina , Jongwook Woo Big Data AI Center (BigDAI) California State University Los Angeles Predicting fraud of AD click using Traditional and Spark ML
  • 2. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Introduction  Data Set  Data Fields Details  Experiment Environment: Traditional and Big Data Systems Work Flow in Azure ML Data Bricks : Data Engineering Algorithms Appendix References
  • 3. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Introduction A person, automated script or computer program imitates a legitimate user clicking on an ad without having an actual interest in the target of the ad's link resulting in misleading click data and wasted money Companies suffers from huge volumes of fraudulent traffic Especially, in mobile market in the world Goal Predict who will download the apps Using Classification model Traditional and Big Data approach
  • 4. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Introduction(Cont’d) TalkingData  China’s largest independent big data service platform – covers over 70% of active mobile devices nationwide  handles 3 billion clicks per day – 90% of which are potentially fraudulent  Goal of the Predictive Analysis  Predict whether a user will download an app – after clicking on a mobile app advertisement  To better target the audience, – to avoid fraudulent practices – and save money
  • 5. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Set  Dataset: TalkingData AdTracking Fraud Detection https://www.kaggle.com/c/talkingdata-adtracking-fraud- detection/data Dataset Property: Original dataset size: 7GB – contains 200 million clicks over 4 day period Dataset format: csv Fields: 8 – Target Column to Predict: ‘is_attributed’
  • 6. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Fields Details
  • 7. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experiment Environment: Traditional and Big Data Systems
  • 8. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experiment Environment: Traditional Azure ML Studio: Traditional for small data set Free Workspace 10GB storage Single node Implement fundamental prediction models – Using Sample data: 80MB (1.1% of the original data set) Select the best model among number of classifications
  • 9. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experiment Environment: Spark Spark ML: Data Filtering: – 1 GB from 8 GB • Implemented Python code to reduce size to 1GB (15%) – We have experimental result with 8GB as well • For another publication Databricks Subscription – Cluster 4.0 (includes Apache Spark 2.3.0, Scala 2.11) • 2 Spark Workers with total of 16 GB Memory and 4 Cores • Python 2.7 • File System : Databricks File System
  • 10. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experiment Environment: Spark (Cont’d) Oracle Big Data Spark Cluster  Oracle BDCE Python 2.7.x, Spark 2.1.x  10 nodes, – 20 OCPUs, 300GB Memory, 1,154GB Storage
  • 11. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Work Flow in Azure ML  Relatively Easy to build and test Drag and Drop GUI Work Flow 1. Data Engineering – Understanding Data – Data preparation – Balancing data statistically 2. Data Science: Machine Learning (ML) – Model building and validation • Classification algorithms – Model evaluation – Model interpretation
  • 12. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Engineering Unbalanced dataset 1: 0.19% App downloaded 0: 99.81% App not downloaded 1GB filtered dataset still too large for the traditional systems: Azure ML Studio More sampling needed for Azure ML
  • 13. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Engineering  SMOTE: Synthetic Minority Over Sampling Technique takes a subset of data from the minority class and creates new synthetic similar instances  Helps balance data & avoid overfitting  Increased percent of minority class (1) from 0.19% to 11%  Stratified Split ensures that the output dataset contains a representative sample of the values in the selected column  Ensures that the random sample does not contain all rows with just 0s  8% sample used = 80 MB
  • 14. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Algorithms in Azure ML Studio  Two-Class Classification:  classify the elements of a given set into two groups – either downloaded, is_attributed (1) – or not downloaded, is_attributed (0) Decision trees  often perform well on imbalanced datasets – as their hierarchical structure allows them to learn signals from both classes. Tree ensembles almost always outperform singular decision trees – Algorithm #1: Two-class Decision Jungle – Algorithm #2: Two-class Decision Forest
  • 15. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Selecting Performance Metrics False Positives indicate the model predicted an app was downloaded when in fact it wasn’t  Goal: minimize the FP => To save $$$
  • 16. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA AZURE ML MODEL #1: TWO-CLASS DECISION JUNGLE • 8% Sample • SMOTE 5000% • 70:30 Split Train/Test • Cross-Validation • Tune Model Hyperparameters • Features used: all 7
  • 17. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA AZURE ML MODEL #1: Tune Model Hyperparameters Without Tune Hyperparameters With Tune Hyperparameters AUC = 0.905 vs 0.606 Precision = 1.0 TP = 35, FP = 0
  • 18. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA AZURE ML MODEL #2: TWO-CLASS DECISION FOREST • 8% Sample • SMOTE 5000% • 70:30 Split Train/Test • Cross-Validation • Tune Model Hyperparameters • Permutation Feature Importance
  • 19. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA AZURE ML MODEL #2: Improving Precision Precision increased to 0.992 FP decreased from 1,659 to 377 FN increased from 1,834 to 5,142 By increasing threshold from 0.5 to 0.8
  • 20. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experimental Results in Azure ML Studio Performance: Execution time with sample data set: 1GB Decision Forrest – takes 2.5 hours Decision Jungle – takes 3 hours 19 min Good Guide from the models of Azure ML Studio  to adopt the 2 similar algorithms for Spark ML – Decision Tree – Random Forest
  • 21. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experimental Results in AzureML Two-class Decision Forest is the best model!
  • 22. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experiment with Spark ML in Databricks 1. Load the data source  1.03 GB  Same filtered data set as Azure ML 2. Train and build the models o Balanced data statistically 3. Evaluate
  • 23. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Engineering Generate features Feature 1: extract day of the week and hour of the day from the click time Feature 2: group clicks by combination of – (Ip, Day_of_week_number and Hour) Feature 3: group clicks by combination of – (Ip, App, Operating System, Day_of_week_number and Hour) Feature 4: group clicks by combination of – (App, Day_of_week_number and Hour) Feature 5: group clicks by combination of – (Ip, App, Device and Operating System) Feature 6: group clicks by combination of – (Ip, Device and Operating System)
  • 24. 24 Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Spark ML MODEL #1: Decision Tree Classifier Confusion Matrix
  • 25. 25 Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Spark ML MODEL #1: Random Forrest Classifier Confusion Matrix
  • 26. 2626 Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Spark ML Result Comparison Decision Tree Classifier is relatively the better model! Decision Tree Classifier Random Forest Classifier AUC 0.815 0.746 PRECISION 0.822 0.878 RECALL 0.633 0.495 TP 86,683 67,726 FP 18,727 9,408 TN 7,112,961 7,122,280 FN 50,074 69,031 RMSE 0.0972 0.1038
  • 27. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experiment in Oracle Cluster Oracle Big Data Spark Cluster  10 nodes, 20 OCPUs, 300GB Memory, 1,154GB Storage 1. Load the data source  1.03 GB 2. Sample the balanced data based on Downloaded  116 MB 3. Train and build the models o Balanced data statistically 4. Evaluate
  • 28. 2828 Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Azure ML Studio and Spark ML Result Comparison TWO-CLASS DECISION JUNGLE (AzureML) TWO-CLASS DECISION FOREST (AzureML) DECISION TREE CLASSIFIER (Databricks ) RANDOM FOREST CLASSIFIER (Databricks ) DECISION TREE CLASSIFIER (Balanced Sample Data, Oracle) RANDOM FOREST CLASSIFIER (Balanced Sample Data, Oracle) AUC 0.905 0.997 0.815 0.746 0.896 0.893 PRECISION 1.0 0.992 0.822 0.878 0.935 0.934 RECALL 0.001 0.902 0.633 0.495 0.807 0.800 TP 35 47,199 86,683 67,726 111,187 110,220 FP 0 377 18,727 9,408 7,712 7,791 TN 52,306 406,228 7,112,961 7,122,280 545,302 545,223 FN 406,605 5,142 50,074 69,031 26,604 27,571 Run Time 2 hrs 2-3 hrs 22 mins 50 mins 24 sec 2 mins
  • 29. 2929 Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Azure ML Studio and Spark ML Result Comparison TWO-CLASS DECISION JUNGLE (AzureML) TWO-CLASS DECISION FOREST (AzureML) DECISION TREE CLASSIFIER (Databricks ) RANDOM FOREST CLASSIFIER (Databricks ) DECISION TREE CLASSIFIER (Balanced Sample Data, Oracle) RANDOM FOREST CLASSIFIER (Balanced Sample Data, Oracle) AUC 0.905 0.997 0.815 0.746 0.896 0.893 PRECISION 1.0 0.992 0.822 0.878 0.935 0.934 RECALL 0.001 0.902 0.633 0.495 0.807 0.800 TP 35 47,199 86,683 67,726 111,187 110,220 FP 0 377 18,727 9,408 7,712 7,791 TN 52,306 406,228 7,112,961 7,122,280 545,302 545,223 FN 406,605 5,142 50,074 69,031 26,604 27,571 Run Time 2 hrs 2-3 hrs 22 mins 50 mins 24 sec 2 mins • Azure ML Two-class Decision Forest is the best model! • Spark ML code need to be updated for the better accuracy • Balanced Sampling based on the fraud in Oracle: • Decision Tree has 0.935 in Precision • Execution Time: 24 secs
  • 30. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Questions?
  • 31. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Appendix Data Set Details (Cont‘d)
  • 32. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Precision vs Recall True Positive (TP): Fraud? Yes it is False Negative (FN): No fraud? but it is False Positive (FP): Fraud? but it is not  Precision  TP / (TP + FP)  Recall  TP / (TP + FN)  Ref: https://en.wikipedia.org/wiki/Precision_and_recall Positive: Event occurs (Fraud) Negative: Event does not Occur (non Fraud)
  • 33. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA References 1. Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat, Jongwook Woo, "Predictive Analysis of Financial Fraud Detection using Azure and Spark ML", Asia Pacific Journal of Information Systems (APJIS), VOL.28│NO.4│December 2018, pp308~319 2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445- 452, ISSN 1942-4795 3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016 4. How to choose algorithms for Microsoft Azure Machine Learning, https://docs.microsoft.com/en- us/azure/machine-learning/machine-learning-algorithm-choice 5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag Chhadva, Shubhra Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-Data-Analysis-using-Spark.pdf 6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html 7. TensorFrames: Google Tensorflow on Apache Spark, https://www.slideshare.net/databricks/tensorframes- google-tensorflow-on-apache-spark 8. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-and-apache- spark
  • 34. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA References 9. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark, https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-frameworks-on- spark 10. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark, https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-at-scalewith- apache-spark-keynote-by-ziya-ma 11. Deep Learning with Apache Spark and TensorFlow, https://databricks.com/blog/2016/01/25/deep- learning-with-apache-spark-and-tensorflow.html 12. Tensor Flow Deep Learning Open SAP 13. Overview of Smart Factory, https://www.slideshare.net/BrendanSheppard1/overview-of-smart-factory- solutions-68137094/6 14. https://dzone.com/articles/sqoop-import-data-from-mysql-tohive 15. https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/data 16. https://blogs.msdn.microsoft.com/andreasderuiter/2015/02/09/performance-measures-in-azure-ml- accuracy-precision-recall-and-f1-score/