AdClickFraud_Bigdata-Apic-Ist-2019

Jongwook Woo
HiPIC
CalStateLA
APIC-IST 2019
June 24 2019
Neha Gupta, Hai Anh Le,
Maria Boldina , Jongwook Woo
Big Data AI Center (BigDAI)
California State University Los Angeles
Predicting fraud of AD click
using
Traditional and Spark ML

Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Introduction
 Data Set
 Data Fields Details
 Experiment Environment: Traditional and Big Data Systems
Work Flow in Azure ML
Data Bricks : Data Engineering
Algorithms
Appendix
References

Jongwook Woo
CalStateLA
Introduction
A person, automated script or computer program imitates a
legitimate user
clicking on an ad without having an actual interest in the target of the ad's
link
resulting in misleading click data and wasted money
Companies suffers from huge volumes of fraudulent traffic
Especially, in mobile market in the world
Goal
Predict who will download the apps
Using Classification model
Traditional and Big Data approach

Jongwook Woo
CalStateLA
Introduction(Cont’d)
TalkingData
 China’s largest independent big data service platform
– covers over 70% of active mobile devices nationwide
 handles 3 billion clicks per day
– 90% of which are potentially fraudulent
 Goal of the Predictive Analysis
 Predict whether a user will download an app
– after clicking on a mobile app advertisement
 To better target the audience,
– to avoid fraudulent practices
– and save money

Jongwook Woo
CalStateLA
Data Set
 Dataset: TalkingData AdTracking Fraud Detection
https://www.kaggle.com/c/talkingdata-adtracking-fraud-
detection/data
Dataset Property:
Original dataset size: 7GB
– contains 200 million clicks over 4 day period
Dataset format: csv
Fields: 8
– Target Column to Predict: ‘is_attributed’

Jongwook Woo
CalStateLA
Data Fields Details

Jongwook Woo
CalStateLA
Experiment Environment:
Traditional and Big Data Systems

Jongwook Woo
CalStateLA
Experiment Environment: Traditional
Azure ML Studio:
Traditional for small data set
Free Workspace
10GB storage
Single node
Implement fundamental prediction models
– Using Sample data: 80MB (1.1% of the original data set)
Select the best model among number of classifications

Jongwook Woo
CalStateLA
Experiment Environment: Spark
Spark ML:
Data Filtering:
– 1 GB from 8 GB
• Implemented Python code to reduce size to 1GB (15%)
– We have experimental result with 8GB as well
• For another publication
Databricks Subscription
– Cluster 4.0 (includes Apache Spark 2.3.0, Scala 2.11)
• 2 Spark Workers with total of 16 GB Memory and 4 Cores
• Python 2.7
• File System : Databricks File System

Jongwook Woo
CalStateLA
Experiment Environment: Spark (Cont’d)
Oracle Big Data Spark Cluster
 Oracle BDCE
Python 2.7.x, Spark 2.1.x
 10 nodes,
– 20 OCPUs, 300GB Memory, 1,154GB Storage

Jongwook Woo
CalStateLA
Work Flow in Azure ML
 Relatively Easy to build and test
Drag and Drop GUI
Work Flow
1. Data Engineering
– Understanding Data
– Data preparation
– Balancing data statistically
2. Data Science: Machine Learning (ML)
– Model building and validation
• Classification algorithms
– Model evaluation
– Model interpretation

Jongwook Woo
CalStateLA
Data Engineering
Unbalanced dataset
1: 0.19% App downloaded
0: 99.81% App not
downloaded
1GB filtered dataset
still too large for the
traditional systems: Azure
ML Studio
More sampling needed for
Azure ML

Jongwook Woo
CalStateLA
Data Engineering
 SMOTE: Synthetic Minority Over
Sampling Technique takes a subset of
data from the minority class and creates
new synthetic similar instances
 Helps balance data & avoid overfitting
 Increased percent of minority class (1) from
0.19% to 11%
 Stratified Split ensures that the output
dataset contains a representative
sample of the values in the selected
column
 Ensures that the random sample does not contain
all rows with just 0s
 8% sample used = 80 MB

Jongwook Woo
CalStateLA
Algorithms in Azure ML Studio
 Two-Class Classification:
 classify the elements of a given set into two groups
– either downloaded, is_attributed (1)
– or not downloaded, is_attributed (0)
Decision trees
 often perform well on imbalanced datasets
– as their hierarchical structure allows them to learn signals from both classes.
Tree ensembles almost always outperform singular decision trees
– Algorithm #1: Two-class Decision Jungle
– Algorithm #2: Two-class Decision Forest

Jongwook Woo
CalStateLA
Selecting Performance Metrics
False Positives indicate
the model predicted an app was downloaded when in fact it wasn’t
 Goal: minimize the FP => To save $$$

Jongwook Woo
CalStateLA
AZURE ML MODEL #1: TWO-CLASS DECISION JUNGLE
• 8% Sample
• SMOTE 5000%
• 70:30 Split
Train/Test
• Cross-Validation
• Tune Model
Hyperparameters
• Features used: all 7

Jongwook Woo
CalStateLA
AZURE ML MODEL #1: Tune Model Hyperparameters
Without Tune
Hyperparameters
With Tune
Hyperparameters
AUC = 0.905 vs 0.606
Precision = 1.0
TP = 35, FP = 0

Jongwook Woo
CalStateLA
AZURE ML MODEL #2: TWO-CLASS DECISION FOREST
• 8% Sample
• SMOTE 5000%
• 70:30 Split
Train/Test
• Cross-Validation
• Tune Model
Hyperparameters
• Permutation Feature
Importance

Jongwook Woo
CalStateLA
AZURE ML MODEL #2: Improving Precision
Precision
increased to 0.992
FP decreased from
1,659 to 377
FN increased from
1,834 to 5,142 By increasing
threshold from 0.5
to 0.8

Jongwook Woo
CalStateLA
Experimental Results in Azure ML Studio
Performance:
Execution time with sample data set: 1GB
Decision Forrest
– takes 2.5 hours
Decision Jungle
– takes 3 hours 19 min
Good Guide from the models of Azure ML Studio
 to adopt the 2 similar algorithms for Spark ML
– Decision Tree
– Random Forest

Jongwook Woo
CalStateLA
Experimental Results in AzureML
Two-class Decision Forest is the best model!

Jongwook Woo
CalStateLA
Experiment with Spark ML in Databricks
1. Load the data source
 1.03 GB
 Same filtered data set as Azure ML
2. Train and build the models
o Balanced data statistically
3. Evaluate

Jongwook Woo
CalStateLA
Data Engineering
Generate features
Feature 1: extract day of the week and hour of the day from the click time
Feature 2: group clicks by combination of
– (Ip, Day_of_week_number and Hour)
– (Ip, App, Operating System, Day_of_week_number and Hour)
– (App, Day_of_week_number and Hour)
– (Ip, App, Device and Operating System)
– (Ip, Device and Operating System)

24
Jongwook Woo
CalStateLA
Spark ML MODEL #1: Decision Tree Classifier
Confusion Matrix

25
Jongwook Woo
CalStateLA
Spark ML MODEL #1: Random Forrest Classifier
Confusion Matrix

2626
Jongwook Woo
CalStateLA
Spark ML Result Comparison
Decision Tree Classifier is relatively the better model!
Decision Tree
Classifier
Random Forest
Classifier
AUC 0.815 0.746
PRECISION 0.822 0.878
RECALL 0.633 0.495
TP 86,683 67,726
FP 18,727 9,408
TN 7,112,961 7,122,280
FN 50,074 69,031
RMSE 0.0972 0.1038

Jongwook Woo
CalStateLA
Experiment in Oracle Cluster
Oracle Big Data Spark Cluster
 10 nodes, 20 OCPUs, 300GB Memory, 1,154GB Storage
1. Load the data source
 1.03 GB
2. Sample the balanced data based on Downloaded
 116 MB
3. Train and build the models
o Balanced data statistically
4. Evaluate

2828
Jongwook Woo
CalStateLA
Azure ML Studio and Spark ML Result Comparison
TWO-CLASS
DECISION
JUNGLE
(AzureML)
TWO-CLASS
DECISION
FOREST
(AzureML)
DECISION
TREE
CLASSIFIER
(Databricks
)
RANDOM
FOREST
CLASSIFIER
(Databricks
)
DECISION
TREE
CLASSIFIER
(Balanced
Sample Data,
Oracle)
RANDOM
FOREST
CLASSIFIER
(Balanced
Sample Data,
Oracle)
AUC 0.905 0.997 0.815 0.746 0.896 0.893
PRECISION 1.0 0.992 0.822 0.878 0.935 0.934
RECALL 0.001 0.902 0.633 0.495 0.807 0.800
TP 35 47,199 86,683 67,726 111,187 110,220
FP 0 377 18,727 9,408 7,712 7,791
TN 52,306 406,228 7,112,961 7,122,280 545,302 545,223
FN 406,605 5,142 50,074 69,031 26,604 27,571
Run Time 2 hrs 2-3 hrs 22 mins 50 mins 24 sec 2 mins

2929
Jongwook Woo
CalStateLA
Azure ML Studio and Spark ML Result Comparison
TWO-CLASS
DECISION
JUNGLE
(AzureML)
TWO-CLASS
DECISION
FOREST
(AzureML)
DECISION
TREE
CLASSIFIER
(Databricks
)
RANDOM
FOREST
CLASSIFIER
(Databricks
)
DECISION
TREE
CLASSIFIER
(Balanced
Sample Data,
Oracle)
RANDOM
FOREST
CLASSIFIER
(Balanced
Sample Data,
Oracle)
AUC 0.905 0.997 0.815 0.746 0.896 0.893
PRECISION 1.0 0.992 0.822 0.878 0.935 0.934
RECALL 0.001 0.902 0.633 0.495 0.807 0.800
TP 35 47,199 86,683 67,726 111,187 110,220
FP 0 377 18,727 9,408 7,712 7,791
TN 52,306 406,228 7,112,961 7,122,280 545,302 545,223
FN 406,605 5,142 50,074 69,031 26,604 27,571
Run Time 2 hrs 2-3 hrs 22 mins 50 mins 24 sec 2 mins
• Azure ML Two-class Decision Forest is the best model!
• Spark ML code need to be updated for the better accuracy
• Balanced Sampling based on the fraud in Oracle:
• Decision Tree has 0.935 in Precision
• Execution Time: 24 secs

Jongwook Woo
CalStateLA
Questions?

Jongwook Woo
CalStateLA
Appendix
Data Set Details (Cont‘d)

Jongwook Woo
CalStateLA
Precision vs Recall
True Positive (TP): Fraud? Yes it is
False Negative (FN): No fraud? but it is
False Positive (FP): Fraud? but it is not
 Precision
 TP / (TP + FP)
 Recall
 TP / (TP + FN)
 Ref: https://en.wikipedia.org/wiki/Precision_and_recall
Positive:
Event occurs
(Fraud)
Negative: Event
does not
Occur (non
Fraud)

Jongwook Woo
CalStateLA
References
1. Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat, Jongwook Woo, "Predictive Analysis of Financial
Fraud Detection using Azure and Spark ML", Asia Pacific Journal of Information Systems (APJIS),
VOL.28│NO.4│December 2018, pp308~319
2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley
Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-
452, ISSN 1942-4795
3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016
4. How to choose algorithms for Microsoft Azure Machine Learning, https://docs.microsoft.com/en-
us/azure/machine-learning/machine-learning-algorithm-choice
5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag Chhadva, Shubhra
Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-Data-Analysis-using-Spark.pdf
6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html
7. TensorFrames: Google Tensorflow on Apache Spark, https://www.slideshare.net/databricks/tensorframes-
google-tensorflow-on-apache-spark
8. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-and-apache-
spark

Jongwook Woo
CalStateLA
References
9. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark,
https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-frameworks-on-
spark
10. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark,
https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-at-scalewith-
apache-spark-keynote-by-ziya-ma
11. Deep Learning with Apache Spark and TensorFlow, https://databricks.com/blog/2016/01/25/deep-
learning-with-apache-spark-and-tensorflow.html
12. Tensor Flow Deep Learning Open SAP
13. Overview of Smart Factory, https://www.slideshare.net/BrendanSheppard1/overview-of-smart-factory-
solutions-68137094/6
14. https://dzone.com/articles/sqoop-import-data-from-mysql-tohive
15. https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/data
16. https://blogs.msdn.microsoft.com/andreasderuiter/2015/02/09/performance-measures-in-azure-ml-
accuracy-precision-recall-and-f1-score/

AdClickFraud_Bigdata-Apic-Ist-2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AdClickFraud_Bigdata-Apic-Ist-2019

Similar to AdClickFraud_Bigdata-Apic-Ist-2019 (20)

Recently uploaded

Recently uploaded (20)

AdClickFraud_Bigdata-Apic-Ist-2019