SlideShare a Scribd company logo
1 of 71
Download to read offline
Jongwook Woo
HiPIC
CalStateLA
IDEAS Live Webinar 2019
May 4 2019
Jongwook Woo, PhD, jwoo5@calstatela.edu
Big Data AI Center (BigDAI)
California State University Los Angeles
Big Data and Predictive Analysis
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Big Data Predictive Analysis
 Summary
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself
Experience:
Since 2002, Professor at California State University Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Universities in Los Angeles
West
North
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Universities in Los Angeles
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
California State University
Los Angeles
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself: S/W Development Lead
http://www.mobygames.com/game/windows/matrix-online/credits
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself: Isaac Engineering, HDP, CDH, Oracle
using Hadoop Big Data
https://www.cloudera.com/more/customers/csula.html
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself: Partners for Services
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself: Collaborations
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Big Data Predictive Analysis
 Summary
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data,
smart phone, online game…
Legacy approach
 Can do
– Improve the speed of CPU
 Increase the storage size
 Only Problem
– Too expensive
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Traditional Way
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Traditional Way
Becomes too Expensive
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Issues
Cannot handle with the legacy approach
Too big
Non-/Semi-structured data
 3 Vs, 4 Vs,…
– Velocity, Volume, Variety
Traditional Systems can handle them
– But Again, Too expensive
Need new systems
Non-expensive
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Another Way
Not Expensive
From 2017 Korean
Blockbuster Movie,
“The Fortress”
(남한산성)
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Another Way
But Works Well with the crazy massive data set
Battle of Nagashino,
1575, Japan
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Another Way
Need Resource Management
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
What is Hadoop?
20
 Apache Hadoop Project in
Jan, 2006 split from Nutch
 Hadoop Founder:
o Doug Cutting
 Apache Committer:
Lucene, Nutch, …
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Super Computer vs Hadoop vs Cloud
Parallel vs. Distributed file systems by Michael Malak
Updated by Jongwook Woo
Cluster for Store Cluster for Compute/Store
Cluster for Compute Cloud Computing adopts
this architecture:
with High Speed N/W
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Definition: Big Data
Non-expensive platform that is distributed parallel systems and
that can store a large scale data and process it in parallel [1, 2]
Hadoop
– Non-expensive Super Computer
– More public than the traditional super computers
• You can store and process your applications
– In your university labs, small companies, research centers
Others with storage and computing services
– Spark
• normally integrated into Hadoop with Hadoop community
– NoSQL DB (Cassandra, MongoDB, Redis, Hbase,…)
– ElasticSearch
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data Data Analysis & Visualization
Sentiment Map of Alphago
Positive
Negative
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
K-Election 2017
(April 29 – May 9)
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Businesses popular in 5 miles of CalStateLA,
USC , UCLA
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Jams and other traffic incidents reported
by users in Dec 2017 – Jan 2018:
(Dalyapraz Dauletbak)
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Big Data Predictive Analysis
 Summary
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data Analysis and Prediction
Big Data Analysis
Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,..
Big Data for Data Analysis
– How to store, compute, analyze massive dataset?
Big Data Science
How to predict the future trend and pattern with the massive
dataset? => Machine Learning
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark
 Limitation in MapReduce
 Hard to program in Java
 Batch Processing
– Not interactive
 Disk storage for intermediate data
– Performance issue
 Spark by UC Berkley AMP Lab
 Started by Matei Zaharia in 2009,
– and open sourced in 2010
In-Memory storage for intermediate data
 20 ~ 100 times faster than
– MapReduce
Good in Machine Learning => Big Data Science
– Iterative algorithms
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark (Cont’d)
Spark ML
Supports Machine Learning libraries
Process massive data set to build prediction models
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep Learning
 Machine Learning
 Has been popular since Google Tensorflow
 Multiple Cores in GPU
– Even with multiple GPUs and CPUs
 Parallel Computing
 GPU (Nvidia GTX 1660 Ti)
 1280 CUDA cores
 Deep Learning Libraries
 Tensor Flow
 PyTorch
 Keras
 Caffe, Caffe2
 Microsoft Cognitive Toolkit (Previously CNTK)
 Apache Mxnet
 DeepLearning4j
 …
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
From Neural Networks to Deep Learning
Deep learning – Different types of architectures
Generative Adversarial Networks (GAN)
Convolutional Neural Networks (CNN)
Neural Networks (NN)
© 2017 SAP SE or an SAP affiliate company. All rights
reserved. ǀ PUBLIC
Recurrent Neural Networks (RNN) &
Long-Short Term Memory (LSTM)
Ref: SAP Enterprise Deep Learning with TensorFlow
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep Learning
CNN
Image Recognition
Video Analysis
 NLP for classification, Prediction
RNN
Time Series Prediction
Speech Recognition/Synthesis
Image/Video Captioning
Text Analysis
– Conversation Q&A
GAN
 Media Generation
– Photo Realistic Images
Human Image Synthesis: Fake faces
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep Learning with Spark
What if we combine Deep Learning and Spark?
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep Learning with Spark
Deep Learning Pipelines for Apache Spark
Databricks
TensorFlowOnSpark
Yahoo! Inc
BigDL (Distributed Deep Learning Library for Apache Spark)
Intel
DL4J (Deeplearning4j On Spark)
Skymind
Distributed Deep Learning with Keras & Spark
Elephas
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Big Data Predictive Analysis: Use Case
 Summary
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Use Case in Spark
 “Predicting AD click fraud using Azure and Spark ML”,
Accepted at The 14th Asia Pacific International Conference on Information
Science and Technology (APIC-IST 2019), June 23-26 2019, Beijing, China
– By Neha Gupta, Hai Anh Le, Maria Boldina
 Machine Learning
Distributed Parallel Computing
– using Spark with Hadoop and Cloud Computing
Not Deep Learning
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Ad Click Fraud
A person, automated script or computer program imitates a
legitimate user
clicking on an ad without having an actual interest in the target of the ad's
link
resulting in misleading click data and wasted money
Companies suffers from huge volumes of fraudulent traffic
Especially, in mobile market in the world
Goal
Predict who will download the apps
Using Classification model
Traditional and Big Data approach
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Ad Click Fraud (Cont’d)
TalkingData
 China’s largest independent big data service platform
– covers over 70% of active mobile devices nationwide
 handles 3 billion clicks per day
– 90% of which are potentially fraudulent
 Goal of the Predictive Analysis
 Predict whether a user will download an app
– after clicking on a mobile app advertisement
 To better target the audience,
– to avoid fraudulent practices
– and save money
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Set
 Dataset: TalkingData AdTracking Fraud Detection
https://www.kaggle.com/c/talkingdata-adtracking-fraud-
detection/data
Dataset Property:
Original dataset size: 7GB
– contains 200 million clicks over 4 day period
Dataset format: csv
Fields: 8
– Target Column to Predict: ‘is_attributed’
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Set Details
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experiment Environment:
Traditional and Big Data Systems
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experiment Environment: Traditional
Azure ML Studio:
Traditional for small data set
Free Workspace
10GB storage
Single node
Implement fundamental prediction models
– Using Sample data: 80MB (1.1% of the original data set)
Select the best model among number of classifications
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experiment Environment: Spark
Spark ML:
Data Filtering:
– 1 GB from 8 GB
• Implemented Python code to reduce size to 1GB (15%)
– We have experimental result with 8GB as well
• For another publication
Databricks Subscription
– Cluster 4.0 (includes Apache Spark 2.3.0, Scala 2.11)
• 2 Spark Workers with total of 16 GB Memory and 4 Cores
• Python 2.7
• File System : Databricks File System
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experiment Environment: Spark (Cont’d)
Oracle Big Data Spark Cluster
 Oracle BDCE
Python 2.7.x, Spark 2.1.x
 10 nodes,
– 20 OCPUs, 300GB Memory, 1,154GB Storage
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Work Flow in Azure ML
 Relatively Easy to build and test
Drag and Drop GUI
Work Flow
1. Data Engineering
– Understanding Data
– Data preparation
– Balancing data statistically
2. Data Science: Machine Learning (ML)
– Model building and validation
• Classification algorithms
– Model evaluation
– Model interpretation
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Engineering
Unbalanced dataset
1: 0.19% App downloaded
0: 99.81% App not
downloaded
1GB filtered dataset
still too large for the
traditional systems: Azure
ML Studio
More sampling needed for
Azure ML
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Engineering
 SMOTE: Synthetic Minority Over
Sampling Technique takes a subset of
data from the minority class and creates
new synthetic similar instances
 Helps balance data & avoid overfitting
 Increased percent of minority class (1) from
0.19% to 11%
 Stratified Split ensures that the output
dataset contains a representative
sample of the values in the selected
column
 Ensures that the random sample does not contain
all rows with just 0s
 8% sample used = 80 MB
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Algorithms in Azure ML Studio
 Two-Class Classification:
 classify the elements of a given set into two groups
– either downloaded, is_attributed (1)
– or not downloaded, is_attributed (0)
Decision trees
 often perform well on imbalanced datasets
– as their hierarchical structure allows them to learn signals from both classes.
Tree ensembles almost always outperform singular decision trees
– Algorithm #1: Two-class Decision Jungle
– Algorithm #2: Two-class Decision Forest
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Selecting Performance Metrics
False Positives indicate
the model predicted an app was downloaded when in fact it wasn’t
 Goal: minimize the FP => To save $$$
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
AZURE ML MODEL #1: TWO-CLASS DECISION JUNGLE
• 8% Sample
• SMOTE 5000%
• 70:30 Split
Train/Test
• Cross-Validation
• Tune Model
Hyperparameters
• Features used: all 7
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
AZURE ML MODEL #1: Tune Model Hyperparameters
Without Tune
Hyperparameters
With Tune
Hyperparameters
AUC = 0.905 vs 0.606
Precision = 1.0
TP = 35, FP = 0
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
AZURE ML MODEL #2: TWO-CLASS DECISION FOREST
• 8% Sample
• SMOTE 5000%
• 70:30 Split
Train/Test
• Cross-Validation
• Tune Model
Hyperparameters
• Permutation Feature
Importance
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
AZURE ML MODEL #2: Improving Precision
Precision
increased to 0.992
FP decreased from
1,659 to 377
FN increased from
1,834 to 5,142 By increasing
threshold from 0.5
to 0.8
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experimental Results in Azure ML Studio
Performance:
Execution time with sample data set: 1GB
Decision Forrest
– takes 2.5 hours
Decision Jungle
– takes 3 hours 19 min
Good Guide from the models of Azure ML Studio
 to adopt the 2 similar algorithms for Spark ML
– Decision Tree
– Random Forest
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experimental Results in AzureML
Two-class Decision Forest is the best model!
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experiment with Spark ML in Databricks
1. Load the data source
 1.03 GB
 Same filtered data set as Azure ML
2. Train and build the models
o Balanced data statistically
3. Evaluate
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Engineering
Generate features
Feature 1: extract day of the week and hour of the day from the click time
Feature 2: group clicks by combination of
– (Ip, Day_of_week_number and Hour)
Feature 3: group clicks by combination of
– (Ip, App, Operating System, Day_of_week_number and Hour)
Feature 4: group clicks by combination of
– (App, Day_of_week_number and Hour)
Feature 5: group clicks by combination of
– (Ip, App, Device and Operating System)
Feature 6: group clicks by combination of
– (Ip, Device and Operating System)
59
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark ML MODEL #1: Decision Tree Classifier
Confusion Matrix
60
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark ML MODEL #1: Random Forrest Classifier
Confusion Matrix
6161
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark ML Result Comparison
Decision Tree Classifier is relatively the better model!
Decision Tree
Classifier
Random Forest
Classifier
AUC 0.815 0.746
PRECISION 0.822 0.878
RECALL 0.633 0.495
TP 86,683 67,726
FP 18,727 9,408
TN 7,112,961 7,122,280
FN 50,074 69,031
RMSE 0.0972 0.1038
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experiment in Oracle Cluster
Oracle Big Data Spark Cluster
 10 nodes, 20 OCPUs, 300GB Memory, 1,154GB Storage
1. Load the data source
 1.03 GB
2. Sample the balanced data based on Downloaded
 116 MB
3. Train and build the models
o Balanced data statistically
4. Evaluate
6363
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Azure ML Studio and Spark ML Result Comparison
TWO-CLASS
DECISION
JUNGLE
(AzureML)
TWO-CLASS
DECISION
FOREST
(AzureML)
DECISION
TREE
CLASSIFIER
(Databricks
)
RANDOM
FOREST
CLASSIFIER
(Databricks
)
DECISION
TREE
CLASSIFIER
(Balanced
Sample Data,
Oracle)
RANDOM
FOREST
CLASSIFIER
(Balanced
Sample Data,
Oracle)
AUC 0.905 0.997 0.815 0.746 0.896 0.893
PRECISION 1.0 0.992 0.822 0.878 0.935 0.934
RECALL 0.001 0.902 0.633 0.495 0.807 0.800
TP 35 47,199 86,683 67,726 111,187 110,220
FP 0 377 18,727 9,408 7,712 7,791
TN 52,306 406,228 7,112,961 7,122,280 545,302 545,223
FN 406,605 5,142 50,074 69,031 26,604 27,571
Run Time 2 hrs 2-3 hrs 22 mins 50 mins 24 sec 2 mins
6464
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Azure ML Studio and Spark ML Result Comparison
TWO-CLASS
DECISION
JUNGLE
(AzureML)
TWO-CLASS
DECISION
FOREST
(AzureML)
DECISION
TREE
CLASSIFIER
(Databricks
)
RANDOM
FOREST
CLASSIFIER
(Databricks
)
DECISION
TREE
CLASSIFIER
(Balanced
Sample Data,
Oracle)
RANDOM
FOREST
CLASSIFIER
(Balanced
Sample Data,
Oracle)
AUC 0.905 0.997 0.815 0.746 0.896 0.893
PRECISION 1.0 0.992 0.822 0.878 0.935 0.934
RECALL 0.001 0.902 0.633 0.495 0.807 0.800
TP 35 47,199 86,683 67,726 111,187 110,220
FP 0 377 18,727 9,408 7,712 7,791
TN 52,306 406,228 7,112,961 7,122,280 545,302 545,223
FN 406,605 5,142 50,074 69,031 26,604 27,571
Run Time 2 hrs 2-3 hrs 22 mins 50 mins 24 sec 2 mins
• Azure ML Two-class Decision Forest is the best model!
• Spark ML code need to be updated for the better accuracy
• Balanced Sampling based on the fraud in Oracle:
• Decision Tree has 0.935 in Precision
• Execution Time: 24 secs
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Big Data Predictive Analysis
 Summary
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Summary
Introduction to Big Data
Ad Click Prediction models in Traditional and Big Data
Systems
Azure ML Studio shows best accuracy with Two Class Decision
Forrest model
Spark ML performance is 3.5 – 7 times faster than Azure ML
Studio with 1 GB data set but not accurate
 With 2 nodes Spark Cluster
Balanced sample data in Oracle has the close accuracy to the
traditional systems while it is 300 times faster
 with 10 nodes Spark Cluster
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Questions?
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Set Details (Cont‘d)
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Precision vs Recall
True Positive (TP): Fraud? Yes it is
False Negative (FN): No fraud? but it is
False Positive (FP): Fraud? but it is not
 Precision
 TP / (TP + FP)
 Recall
 TP / (TP + FN)
 Ref: https://en.wikipedia.org/wiki/Precision_and_recall
Positive:
Event occurs
(Fraud)
Negative: Event
does not
Occur (non
Fraud)
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
References
1. Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat, Jongwook Woo, "Predictive Analysis of Financial
Fraud Detection using Azure and Spark ML", Asia Pacific Journal of Information Systems (APJIS),
VOL.28│NO.4│December 2018, pp308~319
2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley
Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-
452, ISSN 1942-4795
3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016
4. How to choose algorithms for Microsoft Azure Machine Learning, https://docs.microsoft.com/en-
us/azure/machine-learning/machine-learning-algorithm-choice
5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag Chhadva, Shubhra
Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-Data-Analysis-using-Spark.pdf
6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html
7. TensorFrames: Google Tensorflow on Apache Spark, https://www.slideshare.net/databricks/tensorframes-
google-tensorflow-on-apache-spark
8. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-and-apache-
spark
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
References
9. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark,
https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-frameworks-on-
spark
10. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark,
https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-at-scalewith-
apache-spark-keynote-by-ziya-ma
11. Deep Learning with Apache Spark and TensorFlow, https://databricks.com/blog/2016/01/25/deep-
learning-with-apache-spark-and-tensorflow.html
12. Tensor Flow Deep Learning Open SAP
13. Overview of Smart Factory, https://www.slideshare.net/BrendanSheppard1/overview-of-smart-factory-
solutions-68137094/6
14. https://dzone.com/articles/sqoop-import-data-from-mysql-tohive
15. https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/data
16. https://blogs.msdn.microsoft.com/andreasderuiter/2015/02/09/performance-measures-in-azure-ml-
accuracy-precision-recall-and-f1-score/

More Related Content

What's hot

Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIJongwook Woo
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraJongwook Woo
 
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformPredictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformSavita Yadav
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data ScienceKenny Daniel
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligenceManish Jain
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data ScienceAndrew Gardner
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big DataIndu Khemchandani
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceANOOP V S
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data ScienceEdureka!
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learningGiuseppe Manco
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceEdureka!
 

What's hot (20)

Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AI
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformPredictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial Intelligence
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
 
Data science
Data scienceData science
Data science
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Analytics Education in the era of Big Data
Analytics Education in the era of Big DataAnalytics Education in the era of Big Data
Analytics Education in the era of Big Data
 

Similar to Big Data and Predictive Analysis

Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsJongwook Woo
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksJongwook Woo
 
Benefiting from Semantic AI along the data life cycle
Benefiting from Semantic AI along the data life cycleBenefiting from Semantic AI along the data life cycle
Benefiting from Semantic AI along the data life cycleMartin Kaltenböck
 
Ai open powermeetupmarch25th
Ai open powermeetupmarch25thAi open powermeetupmarch25th
Ai open powermeetupmarch25thIBM
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfArmyTrilidiaDevegaSK
 
Ai open powermeetupmarch25th
Ai open powermeetupmarch25thAi open powermeetupmarch25th
Ai open powermeetupmarch25thIBM
 
Ai open powermeetupmarch25th
Ai open powermeetupmarch25thAi open powermeetupmarch25th
Ai open powermeetupmarch25thIBM
 
Understanding the New World of Cognitive Computing
Understanding the New World of Cognitive ComputingUnderstanding the New World of Cognitive Computing
Understanding the New World of Cognitive ComputingDATAVERSITY
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesJongwook Woo
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course pptNjain85
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open DataJongwook Woo
 
Big Data - A Real Life Revolution
Big Data - A Real Life RevolutionBig Data - A Real Life Revolution
Big Data - A Real Life RevolutionCapgemini
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1gauravsc36
 
Smart Data Module 6 d drive the future
Smart Data Module 6 d drive the futureSmart Data Module 6 d drive the future
Smart Data Module 6 d drive the futurecaniceconsulting
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattooMohamed Magdy
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Tomasz Bednarz
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
AI is moving from its academic roots to the forefront of business and industry
AI is moving from its academic roots to the forefront of business and industryAI is moving from its academic roots to the forefront of business and industry
AI is moving from its academic roots to the forefront of business and industryDigital Transformation EXPO Event Series
 

Similar to Big Data and Predictive Analysis (20)

Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on Networks
 
Benefiting from Semantic AI along the data life cycle
Benefiting from Semantic AI along the data life cycleBenefiting from Semantic AI along the data life cycle
Benefiting from Semantic AI along the data life cycle
 
On Big Data
On Big DataOn Big Data
On Big Data
 
Ai open powermeetupmarch25th
Ai open powermeetupmarch25thAi open powermeetupmarch25th
Ai open powermeetupmarch25th
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
 
Ai open powermeetupmarch25th
Ai open powermeetupmarch25thAi open powermeetupmarch25th
Ai open powermeetupmarch25th
 
Ai open powermeetupmarch25th
Ai open powermeetupmarch25thAi open powermeetupmarch25th
Ai open powermeetupmarch25th
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Understanding the New World of Cognitive Computing
Understanding the New World of Cognitive ComputingUnderstanding the New World of Cognitive Computing
Understanding the New World of Cognitive Computing
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use Cases
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open Data
 
Big Data - A Real Life Revolution
Big Data - A Real Life RevolutionBig Data - A Real Life Revolution
Big Data - A Real Life Revolution
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
 
Smart Data Module 6 d drive the future
Smart Data Module 6 d drive the futureSmart Data Module 6 d drive the future
Smart Data Module 6 d drive the future
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
AI is moving from its academic roots to the forefront of business and industry
AI is moving from its academic roots to the forefront of business and industryAI is moving from its academic roots to the forefront of business and industry
AI is moving from its academic roots to the forefront of business and industry
 

More from Jongwook Woo

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum ComputingJongwook Woo
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeJongwook Woo
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017Jongwook Woo
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open PlatformJongwook Woo
 
Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open DataJongwook Woo
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Jongwook Woo
 
Big Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkBig Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkJongwook Woo
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Jongwook Woo
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Jongwook Woo
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsJongwook Woo
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesJongwook Woo
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopJongwook Woo
 
Introduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopIntroduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopJongwook Woo
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingJongwook Woo
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in SeoulJongwook Woo
 

More from Jongwook Woo (18)

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum Computing
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open Data
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure ML
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data
 
Big Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkBig Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using Spark
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use Cases
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using Hadoop
 
Introduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopIntroduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on Hadoop
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive Computing
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul
 

Recently uploaded

➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx9to5mart
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 

Recently uploaded (20)

➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Big Data and Predictive Analysis

  • 1. Jongwook Woo HiPIC CalStateLA IDEAS Live Webinar 2019 May 4 2019 Jongwook Woo, PhD, jwoo5@calstatela.edu Big Data AI Center (BigDAI) California State University Los Angeles Big Data and Predictive Analysis
  • 2. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Big Data Predictive Analysis  Summary
  • 3. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Myself Experience: Since 2002, Professor at California State University Los Angeles – PhD in 2001: Computer Science and Engineering at USC
  • 4. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Universities in Los Angeles West North
  • 5. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Universities in Los Angeles
  • 6. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA California State University Los Angeles
  • 7. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Myself: S/W Development Lead http://www.mobygames.com/game/windows/matrix-online/credits
  • 8. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Myself: Isaac Engineering, HDP, CDH, Oracle using Hadoop Big Data https://www.cloudera.com/more/customers/csula.html
  • 9. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Myself: Partners for Services
  • 10. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Myself: Collaborations
  • 11. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Big Data Predictive Analysis  Summary
  • 12. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data, smart phone, online game… Legacy approach  Can do – Improve the speed of CPU  Increase the storage size  Only Problem – Too expensive
  • 13. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Handling: Traditional Way
  • 14. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Handling: Traditional Way Becomes too Expensive
  • 15. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Issues Cannot handle with the legacy approach Too big Non-/Semi-structured data  3 Vs, 4 Vs,… – Velocity, Volume, Variety Traditional Systems can handle them – But Again, Too expensive Need new systems Non-expensive
  • 16. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – Distributed Systems on non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with non-expensive computers Own super computers Published papers in 2003, 2004
  • 17. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Handling: Another Way Not Expensive From 2017 Korean Blockbuster Movie, “The Fortress” (남한산성)
  • 18. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Handling: Another Way But Works Well with the crazy massive data set Battle of Nagashino, 1575, Japan
  • 19. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Handling: Another Way Need Resource Management
  • 20. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA What is Hadoop? 20  Apache Hadoop Project in Jan, 2006 split from Nutch  Hadoop Founder: o Doug Cutting  Apache Committer: Lucene, Nutch, …
  • 21. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Super Computer vs Hadoop vs Cloud Parallel vs. Distributed file systems by Michael Malak Updated by Jongwook Woo Cluster for Store Cluster for Compute/Store Cluster for Compute Cloud Computing adopts this architecture: with High Speed N/W
  • 22. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Definition: Big Data Non-expensive platform that is distributed parallel systems and that can store a large scale data and process it in parallel [1, 2] Hadoop – Non-expensive Super Computer – More public than the traditional super computers • You can store and process your applications – In your university labs, small companies, research centers Others with storage and computing services – Spark • normally integrated into Hadoop with Hadoop community – NoSQL DB (Cassandra, MongoDB, Redis, Hbase,…) – ElasticSearch
  • 23. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data Data Analysis & Visualization Sentiment Map of Alphago Positive Negative
  • 24. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA K-Election 2017 (April 29 – May 9)
  • 25. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Businesses popular in 5 miles of CalStateLA, USC , UCLA
  • 26. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Jams and other traffic incidents reported by users in Dec 2017 – Jan 2018: (Dalyapraz Dauletbak)
  • 27. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Big Data Predictive Analysis  Summary
  • 28. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data Analysis and Prediction Big Data Analysis Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,.. Big Data for Data Analysis – How to store, compute, analyze massive dataset? Big Data Science How to predict the future trend and pattern with the massive dataset? => Machine Learning
  • 29. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Spark  Limitation in MapReduce  Hard to program in Java  Batch Processing – Not interactive  Disk storage for intermediate data – Performance issue  Spark by UC Berkley AMP Lab  Started by Matei Zaharia in 2009, – and open sourced in 2010 In-Memory storage for intermediate data  20 ~ 100 times faster than – MapReduce Good in Machine Learning => Big Data Science – Iterative algorithms
  • 30. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Spark (Cont’d) Spark ML Supports Machine Learning libraries Process massive data set to build prediction models
  • 31. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Deep Learning  Machine Learning  Has been popular since Google Tensorflow  Multiple Cores in GPU – Even with multiple GPUs and CPUs  Parallel Computing  GPU (Nvidia GTX 1660 Ti)  1280 CUDA cores  Deep Learning Libraries  Tensor Flow  PyTorch  Keras  Caffe, Caffe2  Microsoft Cognitive Toolkit (Previously CNTK)  Apache Mxnet  DeepLearning4j  …
  • 32. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA From Neural Networks to Deep Learning Deep learning – Different types of architectures Generative Adversarial Networks (GAN) Convolutional Neural Networks (CNN) Neural Networks (NN) © 2017 SAP SE or an SAP affiliate company. All rights reserved. ǀ PUBLIC Recurrent Neural Networks (RNN) & Long-Short Term Memory (LSTM) Ref: SAP Enterprise Deep Learning with TensorFlow
  • 33. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Deep Learning CNN Image Recognition Video Analysis  NLP for classification, Prediction RNN Time Series Prediction Speech Recognition/Synthesis Image/Video Captioning Text Analysis – Conversation Q&A GAN  Media Generation – Photo Realistic Images Human Image Synthesis: Fake faces
  • 34. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Deep Learning with Spark What if we combine Deep Learning and Spark?
  • 35. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Deep Learning with Spark Deep Learning Pipelines for Apache Spark Databricks TensorFlowOnSpark Yahoo! Inc BigDL (Distributed Deep Learning Library for Apache Spark) Intel DL4J (Deeplearning4j On Spark) Skymind Distributed Deep Learning with Keras & Spark Elephas
  • 36. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Big Data Predictive Analysis: Use Case  Summary
  • 37. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Use Case in Spark  “Predicting AD click fraud using Azure and Spark ML”, Accepted at The 14th Asia Pacific International Conference on Information Science and Technology (APIC-IST 2019), June 23-26 2019, Beijing, China – By Neha Gupta, Hai Anh Le, Maria Boldina  Machine Learning Distributed Parallel Computing – using Spark with Hadoop and Cloud Computing Not Deep Learning
  • 38. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Ad Click Fraud A person, automated script or computer program imitates a legitimate user clicking on an ad without having an actual interest in the target of the ad's link resulting in misleading click data and wasted money Companies suffers from huge volumes of fraudulent traffic Especially, in mobile market in the world Goal Predict who will download the apps Using Classification model Traditional and Big Data approach
  • 39. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Ad Click Fraud (Cont’d) TalkingData  China’s largest independent big data service platform – covers over 70% of active mobile devices nationwide  handles 3 billion clicks per day – 90% of which are potentially fraudulent  Goal of the Predictive Analysis  Predict whether a user will download an app – after clicking on a mobile app advertisement  To better target the audience, – to avoid fraudulent practices – and save money
  • 40. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Set  Dataset: TalkingData AdTracking Fraud Detection https://www.kaggle.com/c/talkingdata-adtracking-fraud- detection/data Dataset Property: Original dataset size: 7GB – contains 200 million clicks over 4 day period Dataset format: csv Fields: 8 – Target Column to Predict: ‘is_attributed’
  • 41. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Set Details
  • 42. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experiment Environment: Traditional and Big Data Systems
  • 43. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experiment Environment: Traditional Azure ML Studio: Traditional for small data set Free Workspace 10GB storage Single node Implement fundamental prediction models – Using Sample data: 80MB (1.1% of the original data set) Select the best model among number of classifications
  • 44. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experiment Environment: Spark Spark ML: Data Filtering: – 1 GB from 8 GB • Implemented Python code to reduce size to 1GB (15%) – We have experimental result with 8GB as well • For another publication Databricks Subscription – Cluster 4.0 (includes Apache Spark 2.3.0, Scala 2.11) • 2 Spark Workers with total of 16 GB Memory and 4 Cores • Python 2.7 • File System : Databricks File System
  • 45. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experiment Environment: Spark (Cont’d) Oracle Big Data Spark Cluster  Oracle BDCE Python 2.7.x, Spark 2.1.x  10 nodes, – 20 OCPUs, 300GB Memory, 1,154GB Storage
  • 46. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Work Flow in Azure ML  Relatively Easy to build and test Drag and Drop GUI Work Flow 1. Data Engineering – Understanding Data – Data preparation – Balancing data statistically 2. Data Science: Machine Learning (ML) – Model building and validation • Classification algorithms – Model evaluation – Model interpretation
  • 47. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Engineering Unbalanced dataset 1: 0.19% App downloaded 0: 99.81% App not downloaded 1GB filtered dataset still too large for the traditional systems: Azure ML Studio More sampling needed for Azure ML
  • 48. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Engineering  SMOTE: Synthetic Minority Over Sampling Technique takes a subset of data from the minority class and creates new synthetic similar instances  Helps balance data & avoid overfitting  Increased percent of minority class (1) from 0.19% to 11%  Stratified Split ensures that the output dataset contains a representative sample of the values in the selected column  Ensures that the random sample does not contain all rows with just 0s  8% sample used = 80 MB
  • 49. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Algorithms in Azure ML Studio  Two-Class Classification:  classify the elements of a given set into two groups – either downloaded, is_attributed (1) – or not downloaded, is_attributed (0) Decision trees  often perform well on imbalanced datasets – as their hierarchical structure allows them to learn signals from both classes. Tree ensembles almost always outperform singular decision trees – Algorithm #1: Two-class Decision Jungle – Algorithm #2: Two-class Decision Forest
  • 50. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Selecting Performance Metrics False Positives indicate the model predicted an app was downloaded when in fact it wasn’t  Goal: minimize the FP => To save $$$
  • 51. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA AZURE ML MODEL #1: TWO-CLASS DECISION JUNGLE • 8% Sample • SMOTE 5000% • 70:30 Split Train/Test • Cross-Validation • Tune Model Hyperparameters • Features used: all 7
  • 52. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA AZURE ML MODEL #1: Tune Model Hyperparameters Without Tune Hyperparameters With Tune Hyperparameters AUC = 0.905 vs 0.606 Precision = 1.0 TP = 35, FP = 0
  • 53. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA AZURE ML MODEL #2: TWO-CLASS DECISION FOREST • 8% Sample • SMOTE 5000% • 70:30 Split Train/Test • Cross-Validation • Tune Model Hyperparameters • Permutation Feature Importance
  • 54. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA AZURE ML MODEL #2: Improving Precision Precision increased to 0.992 FP decreased from 1,659 to 377 FN increased from 1,834 to 5,142 By increasing threshold from 0.5 to 0.8
  • 55. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experimental Results in Azure ML Studio Performance: Execution time with sample data set: 1GB Decision Forrest – takes 2.5 hours Decision Jungle – takes 3 hours 19 min Good Guide from the models of Azure ML Studio  to adopt the 2 similar algorithms for Spark ML – Decision Tree – Random Forest
  • 56. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experimental Results in AzureML Two-class Decision Forest is the best model!
  • 57. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experiment with Spark ML in Databricks 1. Load the data source  1.03 GB  Same filtered data set as Azure ML 2. Train and build the models o Balanced data statistically 3. Evaluate
  • 58. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Engineering Generate features Feature 1: extract day of the week and hour of the day from the click time Feature 2: group clicks by combination of – (Ip, Day_of_week_number and Hour) Feature 3: group clicks by combination of – (Ip, App, Operating System, Day_of_week_number and Hour) Feature 4: group clicks by combination of – (App, Day_of_week_number and Hour) Feature 5: group clicks by combination of – (Ip, App, Device and Operating System) Feature 6: group clicks by combination of – (Ip, Device and Operating System)
  • 59. 59 Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Spark ML MODEL #1: Decision Tree Classifier Confusion Matrix
  • 60. 60 Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Spark ML MODEL #1: Random Forrest Classifier Confusion Matrix
  • 61. 6161 Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Spark ML Result Comparison Decision Tree Classifier is relatively the better model! Decision Tree Classifier Random Forest Classifier AUC 0.815 0.746 PRECISION 0.822 0.878 RECALL 0.633 0.495 TP 86,683 67,726 FP 18,727 9,408 TN 7,112,961 7,122,280 FN 50,074 69,031 RMSE 0.0972 0.1038
  • 62. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experiment in Oracle Cluster Oracle Big Data Spark Cluster  10 nodes, 20 OCPUs, 300GB Memory, 1,154GB Storage 1. Load the data source  1.03 GB 2. Sample the balanced data based on Downloaded  116 MB 3. Train and build the models o Balanced data statistically 4. Evaluate
  • 63. 6363 Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Azure ML Studio and Spark ML Result Comparison TWO-CLASS DECISION JUNGLE (AzureML) TWO-CLASS DECISION FOREST (AzureML) DECISION TREE CLASSIFIER (Databricks ) RANDOM FOREST CLASSIFIER (Databricks ) DECISION TREE CLASSIFIER (Balanced Sample Data, Oracle) RANDOM FOREST CLASSIFIER (Balanced Sample Data, Oracle) AUC 0.905 0.997 0.815 0.746 0.896 0.893 PRECISION 1.0 0.992 0.822 0.878 0.935 0.934 RECALL 0.001 0.902 0.633 0.495 0.807 0.800 TP 35 47,199 86,683 67,726 111,187 110,220 FP 0 377 18,727 9,408 7,712 7,791 TN 52,306 406,228 7,112,961 7,122,280 545,302 545,223 FN 406,605 5,142 50,074 69,031 26,604 27,571 Run Time 2 hrs 2-3 hrs 22 mins 50 mins 24 sec 2 mins
  • 64. 6464 Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Azure ML Studio and Spark ML Result Comparison TWO-CLASS DECISION JUNGLE (AzureML) TWO-CLASS DECISION FOREST (AzureML) DECISION TREE CLASSIFIER (Databricks ) RANDOM FOREST CLASSIFIER (Databricks ) DECISION TREE CLASSIFIER (Balanced Sample Data, Oracle) RANDOM FOREST CLASSIFIER (Balanced Sample Data, Oracle) AUC 0.905 0.997 0.815 0.746 0.896 0.893 PRECISION 1.0 0.992 0.822 0.878 0.935 0.934 RECALL 0.001 0.902 0.633 0.495 0.807 0.800 TP 35 47,199 86,683 67,726 111,187 110,220 FP 0 377 18,727 9,408 7,712 7,791 TN 52,306 406,228 7,112,961 7,122,280 545,302 545,223 FN 406,605 5,142 50,074 69,031 26,604 27,571 Run Time 2 hrs 2-3 hrs 22 mins 50 mins 24 sec 2 mins • Azure ML Two-class Decision Forest is the best model! • Spark ML code need to be updated for the better accuracy • Balanced Sampling based on the fraud in Oracle: • Decision Tree has 0.935 in Precision • Execution Time: 24 secs
  • 65. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Big Data Predictive Analysis  Summary
  • 66. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Summary Introduction to Big Data Ad Click Prediction models in Traditional and Big Data Systems Azure ML Studio shows best accuracy with Two Class Decision Forrest model Spark ML performance is 3.5 – 7 times faster than Azure ML Studio with 1 GB data set but not accurate  With 2 nodes Spark Cluster Balanced sample data in Oracle has the close accuracy to the traditional systems while it is 300 times faster  with 10 nodes Spark Cluster
  • 67. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Questions?
  • 68. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Set Details (Cont‘d)
  • 69. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Precision vs Recall True Positive (TP): Fraud? Yes it is False Negative (FN): No fraud? but it is False Positive (FP): Fraud? but it is not  Precision  TP / (TP + FP)  Recall  TP / (TP + FN)  Ref: https://en.wikipedia.org/wiki/Precision_and_recall Positive: Event occurs (Fraud) Negative: Event does not Occur (non Fraud)
  • 70. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA References 1. Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat, Jongwook Woo, "Predictive Analysis of Financial Fraud Detection using Azure and Spark ML", Asia Pacific Journal of Information Systems (APJIS), VOL.28│NO.4│December 2018, pp308~319 2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445- 452, ISSN 1942-4795 3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016 4. How to choose algorithms for Microsoft Azure Machine Learning, https://docs.microsoft.com/en- us/azure/machine-learning/machine-learning-algorithm-choice 5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag Chhadva, Shubhra Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-Data-Analysis-using-Spark.pdf 6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html 7. TensorFrames: Google Tensorflow on Apache Spark, https://www.slideshare.net/databricks/tensorframes- google-tensorflow-on-apache-spark 8. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-and-apache- spark
  • 71. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA References 9. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark, https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-frameworks-on- spark 10. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark, https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-at-scalewith- apache-spark-keynote-by-ziya-ma 11. Deep Learning with Apache Spark and TensorFlow, https://databricks.com/blog/2016/01/25/deep- learning-with-apache-spark-and-tensorflow.html 12. Tensor Flow Deep Learning Open SAP 13. Overview of Smart Factory, https://www.slideshare.net/BrendanSheppard1/overview-of-smart-factory- solutions-68137094/6 14. https://dzone.com/articles/sqoop-import-data-from-mysql-tohive 15. https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/data 16. https://blogs.msdn.microsoft.com/andreasderuiter/2015/02/09/performance-measures-in-azure-ml- accuracy-precision-recall-and-f1-score/