Big Data and Predictive Analysis

Jongwook Woo
HiPIC
CalStateLA
IDEAS Live Webinar 2019
May 4 2019
Jongwook Woo, PhD, jwoo5@calstatela.edu
Big Data AI Center (BigDAI)
California State University Los Angeles
Big Data and Predictive Analysis

Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Big Data Predictive Analysis
 Summary

Jongwook Woo
CalStateLA
Myself
Experience:
Since 2002, Professor at California State University Los Angeles
– PhD in 2001: Computer Science and Engineering at USC

Jongwook Woo
CalStateLA
Universities in Los Angeles
West
North

Jongwook Woo
CalStateLA
Universities in Los Angeles

Jongwook Woo
CalStateLA
California State University
Los Angeles

Jongwook Woo
CalStateLA
Myself: S/W Development Lead
http://www.mobygames.com/game/windows/matrix-online/credits

Jongwook Woo
CalStateLA
Myself: Isaac Engineering, HDP, CDH, Oracle
using Hadoop Big Data
https://www.cloudera.com/more/customers/csula.html

Jongwook Woo
CalStateLA
Myself: Partners for Services

Jongwook Woo
CalStateLA
Myself: Collaborations

Jongwook Woo
CalStateLA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data,
smart phone, online game…
Legacy approach
 Can do
– Improve the speed of CPU
 Increase the storage size
 Only Problem
– Too expensive

Jongwook Woo
CalStateLA
Data Handling: Traditional Way

Jongwook Woo
CalStateLA
Data Handling: Traditional Way
Becomes too Expensive

Jongwook Woo
CalStateLA
Data Issues
Cannot handle with the legacy approach
Too big
Non-/Semi-structured data
 3 Vs, 4 Vs,…
– Velocity, Volume, Variety
Traditional Systems can handle them
– But Again, Too expensive
Need new systems
Non-expensive

Jongwook Woo
CalStateLA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004

Jongwook Woo
CalStateLA
Data Handling: Another Way
Not Expensive
From 2017 Korean
Blockbuster Movie,
“The Fortress”
(남한산성)

Jongwook Woo
CalStateLA
But Works Well with the crazy massive data set
Battle of Nagashino,
1575, Japan

Jongwook Woo
CalStateLA
Need Resource Management

Jongwook Woo
CalStateLA
What is Hadoop?
20
 Apache Hadoop Project in
Jan, 2006 split from Nutch
 Hadoop Founder:
o Doug Cutting
 Apache Committer:
Lucene, Nutch, …

Jongwook Woo
CalStateLA
Super Computer vs Hadoop vs Cloud
Parallel vs. Distributed file systems by Michael Malak
Updated by Jongwook Woo
Cluster for Store Cluster for Compute/Store
Cluster for Compute Cloud Computing adopts
this architecture:
with High Speed N/W

Jongwook Woo
CalStateLA
Definition: Big Data
Non-expensive platform that is distributed parallel systems and
that can store a large scale data and process it in parallel [1, 2]
Hadoop
– Non-expensive Super Computer
– More public than the traditional super computers
• You can store and process your applications
– In your university labs, small companies, research centers
Others with storage and computing services
– Spark
• normally integrated into Hadoop with Hadoop community
– NoSQL DB (Cassandra, MongoDB, Redis, Hbase,…)
– ElasticSearch

Jongwook Woo
CalStateLA
Big Data Data Analysis & Visualization
Sentiment Map of Alphago
Positive
Negative

Jongwook Woo
CalStateLA
K-Election 2017
(April 29 – May 9)

Jongwook Woo
CalStateLA
Businesses popular in 5 miles of CalStateLA,
USC , UCLA

Jongwook Woo
CalStateLA
Jams and other traffic incidents reported
by users in Dec 2017 – Jan 2018:
(Dalyapraz Dauletbak)

Jongwook Woo
CalStateLA
Big Data Analysis and Prediction
Big Data Analysis
Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,..
Big Data for Data Analysis
– How to store, compute, analyze massive dataset?
Big Data Science
How to predict the future trend and pattern with the massive
dataset? => Machine Learning

Jongwook Woo
CalStateLA
Spark
 Limitation in MapReduce
 Hard to program in Java
 Batch Processing
– Not interactive
 Disk storage for intermediate data
– Performance issue
 Spark by UC Berkley AMP Lab
 Started by Matei Zaharia in 2009,
– and open sourced in 2010
In-Memory storage for intermediate data
 20 ~ 100 times faster than
– MapReduce
Good in Machine Learning => Big Data Science
– Iterative algorithms

Jongwook Woo
CalStateLA
Spark (Cont’d)
Spark ML
Supports Machine Learning libraries
Process massive data set to build prediction models

Jongwook Woo
CalStateLA
Deep Learning
 Machine Learning
 Has been popular since Google Tensorflow
 Multiple Cores in GPU
– Even with multiple GPUs and CPUs
 Parallel Computing
 GPU (Nvidia GTX 1660 Ti)
 1280 CUDA cores
 Deep Learning Libraries
 Tensor Flow
 PyTorch
 Keras
 Caffe, Caffe2
 Microsoft Cognitive Toolkit (Previously CNTK)
 Apache Mxnet
 DeepLearning4j
 …

Jongwook Woo
CalStateLA
From Neural Networks to Deep Learning
Deep learning – Different types of architectures
Generative Adversarial Networks (GAN)
Convolutional Neural Networks (CNN)
Neural Networks (NN)
© 2017 SAP SE or an SAP affiliate company. All rights
reserved. ǀ PUBLIC
Recurrent Neural Networks (RNN) &
Long-Short Term Memory (LSTM)
Ref: SAP Enterprise Deep Learning with TensorFlow

Jongwook Woo
CalStateLA
Deep Learning
CNN
Image Recognition
Video Analysis
 NLP for classification, Prediction
RNN
Time Series Prediction
Speech Recognition/Synthesis
Image/Video Captioning
Text Analysis
– Conversation Q&A
GAN
 Media Generation
– Photo Realistic Images
Human Image Synthesis: Fake faces

Jongwook Woo
CalStateLA
Deep Learning with Spark
What if we combine Deep Learning and Spark?

Jongwook Woo
CalStateLA
Deep Learning with Spark
Deep Learning Pipelines for Apache Spark
Databricks
TensorFlowOnSpark
Yahoo! Inc
BigDL (Distributed Deep Learning Library for Apache Spark)
Intel
DL4J (Deeplearning4j On Spark)
Skymind
Distributed Deep Learning with Keras & Spark
Elephas

Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Big Data Predictive Analysis: Use Case
 Summary

Jongwook Woo
CalStateLA
Use Case in Spark
 “Predicting AD click fraud using Azure and Spark ML”,
Accepted at The 14th Asia Pacific International Conference on Information
Science and Technology (APIC-IST 2019), June 23-26 2019, Beijing, China
– By Neha Gupta, Hai Anh Le, Maria Boldina
 Machine Learning
Distributed Parallel Computing
– using Spark with Hadoop and Cloud Computing
Not Deep Learning

Jongwook Woo
CalStateLA
Ad Click Fraud
A person, automated script or computer program imitates a
legitimate user
clicking on an ad without having an actual interest in the target of the ad's
link
resulting in misleading click data and wasted money
Companies suffers from huge volumes of fraudulent traffic
Especially, in mobile market in the world
Goal
Predict who will download the apps
Using Classification model
Traditional and Big Data approach

Jongwook Woo
CalStateLA
Ad Click Fraud (Cont’d)
TalkingData
 China’s largest independent big data service platform
– covers over 70% of active mobile devices nationwide
 handles 3 billion clicks per day
– 90% of which are potentially fraudulent
 Goal of the Predictive Analysis
 Predict whether a user will download an app
– after clicking on a mobile app advertisement
 To better target the audience,
– to avoid fraudulent practices
– and save money

Jongwook Woo
CalStateLA
Data Set
 Dataset: TalkingData AdTracking Fraud Detection
https://www.kaggle.com/c/talkingdata-adtracking-fraud-
detection/data
Dataset Property:
Original dataset size: 7GB
– contains 200 million clicks over 4 day period
Dataset format: csv
Fields: 8
– Target Column to Predict: ‘is_attributed’

Jongwook Woo
CalStateLA
Data Set Details

Jongwook Woo
CalStateLA
Experiment Environment:
Traditional and Big Data Systems

Jongwook Woo
CalStateLA
Experiment Environment: Traditional
Azure ML Studio:
Traditional for small data set
Free Workspace
10GB storage
Single node
Implement fundamental prediction models
– Using Sample data: 80MB (1.1% of the original data set)
Select the best model among number of classifications

Jongwook Woo
CalStateLA
Experiment Environment: Spark
Spark ML:
Data Filtering:
– 1 GB from 8 GB
• Implemented Python code to reduce size to 1GB (15%)
– We have experimental result with 8GB as well
• For another publication
Databricks Subscription
– Cluster 4.0 (includes Apache Spark 2.3.0, Scala 2.11)
• 2 Spark Workers with total of 16 GB Memory and 4 Cores
• Python 2.7
• File System : Databricks File System

Jongwook Woo
CalStateLA
Experiment Environment: Spark (Cont’d)
Oracle Big Data Spark Cluster
 Oracle BDCE
Python 2.7.x, Spark 2.1.x
 10 nodes,
– 20 OCPUs, 300GB Memory, 1,154GB Storage

Jongwook Woo
CalStateLA
Work Flow in Azure ML
 Relatively Easy to build and test
Drag and Drop GUI
Work Flow
1. Data Engineering
– Understanding Data
– Data preparation
– Balancing data statistically
2. Data Science: Machine Learning (ML)
– Model building and validation
• Classification algorithms
– Model evaluation
– Model interpretation

Jongwook Woo
CalStateLA
Data Engineering
Unbalanced dataset
1: 0.19% App downloaded
0: 99.81% App not
downloaded
1GB filtered dataset
still too large for the
traditional systems: Azure
ML Studio
More sampling needed for
Azure ML

Jongwook Woo
CalStateLA
Data Engineering
 SMOTE: Synthetic Minority Over
Sampling Technique takes a subset of
data from the minority class and creates
new synthetic similar instances
 Helps balance data & avoid overfitting
 Increased percent of minority class (1) from
0.19% to 11%
 Stratified Split ensures that the output
dataset contains a representative
sample of the values in the selected
column
 Ensures that the random sample does not contain
all rows with just 0s
 8% sample used = 80 MB

Jongwook Woo
CalStateLA
Algorithms in Azure ML Studio
 Two-Class Classification:
 classify the elements of a given set into two groups
– either downloaded, is_attributed (1)
– or not downloaded, is_attributed (0)
Decision trees
 often perform well on imbalanced datasets
– as their hierarchical structure allows them to learn signals from both classes.
Tree ensembles almost always outperform singular decision trees
– Algorithm #1: Two-class Decision Jungle
– Algorithm #2: Two-class Decision Forest

Jongwook Woo
CalStateLA
Selecting Performance Metrics
False Positives indicate
the model predicted an app was downloaded when in fact it wasn’t
 Goal: minimize the FP => To save $$$

Jongwook Woo
CalStateLA
AZURE ML MODEL #1: TWO-CLASS DECISION JUNGLE
• 8% Sample
• SMOTE 5000%
• 70:30 Split
Train/Test
• Cross-Validation
• Tune Model
Hyperparameters
• Features used: all 7

Jongwook Woo
CalStateLA
AZURE ML MODEL #1: Tune Model Hyperparameters
Without Tune
Hyperparameters
With Tune
Hyperparameters
AUC = 0.905 vs 0.606
Precision = 1.0
TP = 35, FP = 0

Jongwook Woo
CalStateLA
AZURE ML MODEL #2: TWO-CLASS DECISION FOREST
• 8% Sample
• SMOTE 5000%
• 70:30 Split
Train/Test
• Cross-Validation
• Tune Model
Hyperparameters
• Permutation Feature
Importance

Jongwook Woo
CalStateLA
AZURE ML MODEL #2: Improving Precision
Precision
increased to 0.992
FP decreased from
1,659 to 377
FN increased from
1,834 to 5,142 By increasing
threshold from 0.5
to 0.8

Jongwook Woo
CalStateLA
Experimental Results in Azure ML Studio
Performance:
Execution time with sample data set: 1GB
Decision Forrest
– takes 2.5 hours
Decision Jungle
– takes 3 hours 19 min
Good Guide from the models of Azure ML Studio
 to adopt the 2 similar algorithms for Spark ML
– Decision Tree
– Random Forest

Jongwook Woo
CalStateLA
Experimental Results in AzureML
Two-class Decision Forest is the best model!

Jongwook Woo
CalStateLA
Experiment with Spark ML in Databricks
1. Load the data source
 1.03 GB
 Same filtered data set as Azure ML
2. Train and build the models
o Balanced data statistically
3. Evaluate

Jongwook Woo
CalStateLA
Data Engineering
Generate features
Feature 1: extract day of the week and hour of the day from the click time
Feature 2: group clicks by combination of
– (Ip, Day_of_week_number and Hour)
– (Ip, App, Operating System, Day_of_week_number and Hour)
– (App, Day_of_week_number and Hour)
– (Ip, App, Device and Operating System)
– (Ip, Device and Operating System)

59
Jongwook Woo
CalStateLA
Spark ML MODEL #1: Decision Tree Classifier
Confusion Matrix

60
Jongwook Woo
CalStateLA
Spark ML MODEL #1: Random Forrest Classifier
Confusion Matrix

6161
Jongwook Woo
CalStateLA
Spark ML Result Comparison
Decision Tree Classifier is relatively the better model!
Decision Tree
Classifier
Random Forest
Classifier
AUC 0.815 0.746
PRECISION 0.822 0.878
RECALL 0.633 0.495
TP 86,683 67,726
FP 18,727 9,408
TN 7,112,961 7,122,280
FN 50,074 69,031
RMSE 0.0972 0.1038

Jongwook Woo
CalStateLA
Experiment in Oracle Cluster
Oracle Big Data Spark Cluster
 10 nodes, 20 OCPUs, 300GB Memory, 1,154GB Storage
1. Load the data source
 1.03 GB
2. Sample the balanced data based on Downloaded
 116 MB
3. Train and build the models
o Balanced data statistically
4. Evaluate

6363
Jongwook Woo
CalStateLA
Azure ML Studio and Spark ML Result Comparison
TWO-CLASS
DECISION
JUNGLE
(AzureML)
TWO-CLASS
DECISION
FOREST
(AzureML)
DECISION
TREE
CLASSIFIER
(Databricks
)
RANDOM
FOREST
CLASSIFIER
(Databricks
)
DECISION
TREE
CLASSIFIER
(Balanced
Sample Data,
Oracle)
RANDOM
FOREST
CLASSIFIER
(Balanced
Sample Data,
Oracle)
AUC 0.905 0.997 0.815 0.746 0.896 0.893
PRECISION 1.0 0.992 0.822 0.878 0.935 0.934
RECALL 0.001 0.902 0.633 0.495 0.807 0.800
TP 35 47,199 86,683 67,726 111,187 110,220
FP 0 377 18,727 9,408 7,712 7,791
TN 52,306 406,228 7,112,961 7,122,280 545,302 545,223
FN 406,605 5,142 50,074 69,031 26,604 27,571
Run Time 2 hrs 2-3 hrs 22 mins 50 mins 24 sec 2 mins

6464
Jongwook Woo
CalStateLA
Azure ML Studio and Spark ML Result Comparison
TWO-CLASS
DECISION
JUNGLE
(AzureML)
TWO-CLASS
DECISION
FOREST
(AzureML)
DECISION
TREE
CLASSIFIER
(Databricks
)
RANDOM
FOREST
CLASSIFIER
(Databricks
)
DECISION
TREE
CLASSIFIER
(Balanced
Sample Data,
Oracle)
RANDOM
FOREST
CLASSIFIER
(Balanced
Sample Data,
Oracle)
AUC 0.905 0.997 0.815 0.746 0.896 0.893
PRECISION 1.0 0.992 0.822 0.878 0.935 0.934
RECALL 0.001 0.902 0.633 0.495 0.807 0.800
TP 35 47,199 86,683 67,726 111,187 110,220
FP 0 377 18,727 9,408 7,712 7,791
TN 52,306 406,228 7,112,961 7,122,280 545,302 545,223
FN 406,605 5,142 50,074 69,031 26,604 27,571
Run Time 2 hrs 2-3 hrs 22 mins 50 mins 24 sec 2 mins
• Azure ML Two-class Decision Forest is the best model!
• Spark ML code need to be updated for the better accuracy
• Balanced Sampling based on the fraud in Oracle:
• Decision Tree has 0.935 in Precision
• Execution Time: 24 secs

Jongwook Woo
CalStateLA
Summary
Introduction to Big Data
Ad Click Prediction models in Traditional and Big Data
Systems
Azure ML Studio shows best accuracy with Two Class Decision
Forrest model
Spark ML performance is 3.5 – 7 times faster than Azure ML
Studio with 1 GB data set but not accurate
 With 2 nodes Spark Cluster
Balanced sample data in Oracle has the close accuracy to the
traditional systems while it is 300 times faster
 with 10 nodes Spark Cluster

Jongwook Woo
CalStateLA
Questions?

Jongwook Woo
CalStateLA
Data Set Details (Cont‘d)

Jongwook Woo
CalStateLA
Precision vs Recall
True Positive (TP): Fraud? Yes it is
False Negative (FN): No fraud? but it is
False Positive (FP): Fraud? but it is not
 Precision
 TP / (TP + FP)
 Recall
 TP / (TP + FN)
 Ref: https://en.wikipedia.org/wiki/Precision_and_recall
Positive:
Event occurs
(Fraud)
Negative: Event
does not
Occur (non
Fraud)

Jongwook Woo
CalStateLA
References
1. Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat, Jongwook Woo, "Predictive Analysis of Financial
Fraud Detection using Azure and Spark ML", Asia Pacific Journal of Information Systems (APJIS),
VOL.28│NO.4│December 2018, pp308~319
2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley
Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-
452, ISSN 1942-4795
3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016
4. How to choose algorithms for Microsoft Azure Machine Learning, https://docs.microsoft.com/en-
us/azure/machine-learning/machine-learning-algorithm-choice
5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag Chhadva, Shubhra
Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-Data-Analysis-using-Spark.pdf
6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html
7. TensorFrames: Google Tensorflow on Apache Spark, https://www.slideshare.net/databricks/tensorframes-
google-tensorflow-on-apache-spark
8. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-and-apache-
spark

Jongwook Woo
CalStateLA
References
9. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark,
https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-frameworks-on-
spark
10. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark,
https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-at-scalewith-
apache-spark-keynote-by-ziya-ma
11. Deep Learning with Apache Spark and TensorFlow, https://databricks.com/blog/2016/01/25/deep-
learning-with-apache-spark-and-tensorflow.html
12. Tensor Flow Deep Learning Open SAP
13. Overview of Smart Factory, https://www.slideshare.net/BrendanSheppard1/overview-of-smart-factory-
solutions-68137094/6
14. https://dzone.com/articles/sqoop-import-data-from-mysql-tohive
15. https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/data
16. https://blogs.msdn.microsoft.com/andreasderuiter/2015/02/09/performance-measures-in-azure-ml-
accuracy-precision-recall-and-f1-score/

Big Data and Predictive Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data and Predictive Analysis

Similar to Big Data and Predictive Analysis (20)

More from Jongwook Woo

More from Jongwook Woo (18)

Recently uploaded

Recently uploaded (20)

Big Data and Predictive Analysis