SlideShare a Scribd company logo
ECHELON
ASIA SUMMIT 2017
STARTUP ACADEMY
[WORKSHOP]
INTRODUCTION TO
DATA SCIENCE
29th June 2017
Garrett Teoh Hor Keong
OPENING
PROGRAM FLOW
1. Data Science Fundamentals
(10 min)
2. Exploratory Data Analysis
(25 min)
3. Building Machine Learning & AI
(10 min)
4. Evaluating Algorithms & Models
(20 min)
5. Visualizing Data & Storytelling
(20 min)
6. Questions & Answers
(5 min)
DATA SCIENCE
FUNDAMENTAL
S
STAGES OF DATA SCIENCE
What has
happened?
What will
happen?
What should
happen?
Data Collection Machine Learning Cognitive
Actionable Insights!Visualizations / Storytelling
Exploratory Data Analysis
Classifications
CROSS INDUSTRY STANDARD PROCESS – DATA
MINING
Business
Understanding
Collect &
Understand
Data
Data Prep
&
Cleansing
Build
AI & Models
Evaluate
Models
Deploy &
Productionalize Data Lake
Local vs Cloud?
What has happened?
What will happen?
What should happen?
1
2
3
6
5
4
DOMAINS OF DATA SCIENCE
Supervised
Learning
- Species
Classifications
- HR Churn
- Sales
Conversion
- Performance
Ranking
Unsupervised
Learning
- Credit Card
Fraud
- Procurement
Fraud
- Preventive
Maintenance
Imaging &
Recognition
- Facial
Recognition
- Product
Categories
- Healthcare
Imaging
Operations
Research
- Optimizing
Costs vs Revenue
(HR Planning)
- Optimizing
Costs for
Machines, Pipes
to Gas Stations
(Revenue)
Recommend
Engine
- Collaborative
Filtering
- Cross-Sell
Products
TOOLS FOR DATA SCIENCE
DRIVING TOWARDS DIGITAL TRANSFORMATION
 Data Scientists (Building Models, Evaluation)
 Data Analysts (Visualizations, reports, EDA)
 Data Engineer (Data Lake, Deployment, ETL)
 IT Developers (Deployment, Data Collections)
 Internal (Employees, Accounts, Audit Logs, Marketing)
 External (Sales, Customers Behaviours, Measurements)
 Public (Census, Info sites, Facebook, Twitter, New & Media)
 Data Aggregator Companies
 Data Storage
 Data Processing & ETLs
 Data Access & Governance
 Computational Resource
 Real Time Processing
 Visualization Tools
 Data Modelling Tools
 Deployment Tools
EXPLORATORY
DATA ANALYSIS
ADULT CENSUS INCOME DATASET – BACKGROUND
This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining
and Visualization, Silicon Graphics). The prediction task is to determine whether a person makes over $50K a year.
Link to data: https://goo.gl/qE7TPf (adult.csv.zip)
ADULT CENSUS INCOME DATASET – UNDERSTANDING
Link to data description: https://www.kaggle.com/uciml/adult-census-income
Response (Binary)
Features or
Predictors (14)
Data Types:
Integer
Continuous
Binary
Date/Time
Ordinal
Categorical
Text
PREPARING & CLEANING UP THE DATASET
Explore how to use Excel Sheet (xlsx) to prepare and clean up the Adult Census Income dataset.
Step 1 • Convert raw data from .csv format to .xlsx format. “save as…”
Step 2 • Click on “sort & filter” to examine data type and categories.
Step 3 • Identify blanks, missing data, or irrelevant data.
Step 4
• Alternatively, use “pivot tables” and “charts” to identify distribution and categorical
counts. Select all data using ctrl+shift+arrow keys -> click on insert pivot tables ->
new worksheet.
Step 5 • Create a derived binary response (using “IF” function to return 0 or 1).
Step 6 • Use “VLOOKUP” to replace blanks, missing or irrelevant data.
Step 7
• Insert “combo clustered” 2-D chart using the data on pivot table to examine
correlation of response between each feature.
Step 8 • Remove features with high % of missing data.
NUMERICAL FEATURES DISTRIBUTION & RECODING
Some numerical (continuous or integers) features might be slightly correlated to the response, and thus it is
important to identify the trends of these features and recode them as necessarily.
Step 1
• Examine correlations of the continuous feature with response or using parametric
(Student’s T-test) /non-parametric (Wilcoxon ranked) tests.
Step 2
• Observe the histogram plot of the continuous feature with response by making a
“combo clustered” or a “scattered plot”
Step 3 • Identify highly correlated segments and recode feature
Mean Age (target=0): 37
Mean Age (target=1): 44
ADULT CENSUS INCOME DATASET – EDA PRACTICE
The cleaned data can be downloaded from https://goo.gl/qE7TPf (cleaned-adult.zip)
EXPLORATORY DATA ANALYSIS – CORRELATION PLOT
relationships Female Male Grand Total
Husband 0.01% 99.99% 100.00%
Wife 99.87% 0.13% 100.00%
Not-in-family 46.66% 53.34% 100.00%
Other-relative 43.83% 56.17% 100.00%
Own-child 44.30% 55.70% 100.00%
Unmarried 77.02% 22.98% 100.00%
Grand Total 33.08% 66.92% 100.00%
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
Husband Wife Not-in-family Other-relative Own-child Unmarried
Relationship vs Gender
Female (%) Male (%)
EXPLORATORY DATA ANALYSIS SUMMARY
Executive Summary (What has happened?)
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
0
2000
4000
6000
8000
10000
12000
0 1 2 3 4 5
Age Group vs High Income
counts high income (%)
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
0
2000
4000
6000
8000
10000
12000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Education num vs High Income
count high income (%)
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
50.00%
0
2000
4000
6000
8000
10000
12000
14000
Relationships vs High Income
count high income (%)
Overall High Income 24%
EXPLORATORY DATA ANALYSIS SUMMARY
Executive Summary (What has happened?)
Higher earned
incomers tend to
have a significantly
higher capital
gain/loss.
Both this features
might improve
prediction modelling
performance.
BUILDING
MACHINE
LEARNING &
AI
MACHINE LEARNING ALGORITHMS – UNSUPERVISED
• You do not know what you don’t have an idea
• All data is unlabelled and the algorithms learn to inherent structure from the input data.
• You only have input data (X) and no corresponding output variables.
 How a fraudster does it?
 When will it happen?
 How to differentiate them?
 Where are the anomalies?
Telco Fraud People Management
 Who is the top performer?
 What are the metrics?
 Who to award a promotion?
 Where do they stand out?
Product Cross/Up Sell
 Who will need those products?
 What is inside their shopping carts?
 Which products to market?
 How to package products?
MACHINE LEARNING ALGORITHMS – UNSUPERVISED
CLUSTERING
Hierarchical
Clustering
K -
Means
Kernel
Density
Discriminant
Analysis
Isolation
Forest
One-Class
SVM
ASSOCIATIONS
Apriori
Eclat
FP-
Growth
Context
Based
MACHINE LEARNING ALGORITHMS – SUPERVISED
• You do not know what you knew
• All data is labelled and the algorithms learn to predict the output from the input data
• you have input variables (x) and an output variable (Y)
 How a lead will convert?
 What features or properties
are important?
 How to deal with leads with
marginal probability?
Leads Conversion Financing
 Who is a good borrower?
 Who will default on a loan?
 Rules or pattern to
differentiate them?
 How to interpret
probabilities of default?
Property Sales
 What is the best price?
 What features affect sale price?
 Do price affects sale probability?
 Optimizing time, price, ability to
close a sales?
MACHINE LEARNING ALGORITHMS – SUPERVISED
CLASSIFICATIONS REGRESSIONS
- Decision Tree, Random Forest
- eXtreme Gradient BOOSTing (XGBOOST)
- Gradient Boosted Trees
- Generalised Linear Model
- Logistic Regression
- Neural Networks
- Support Vector Machine (SVM)
- K Nearest Neighbour (KNN), K Means
- eXtreme Gradient BOOSTing (XGBOOST)
- Linear Gradient Boosted
- Generalised Linear Model
- Lasso, Ridge Regression
- Elastic Net
- Least Angle Regression (LARS)
- Neural Networks
TOOLS & RESOURCES CONSIDERATIONS
• Near real time updates and monitoring. (e.g. Pricing Analysis, Recommendation Engine,
Threat/Fraud Detection, Preventive Maintenance)
• Periodic updates. (People Analysis, Marketing Response Prediction, Sales Forecast, Cancer/Disease
Risk)
• Predict-On-Demand. (Credit Risk/Scoring, Leads Conversion)
• Storage:
• Hadoop Distributed File System (HDFS), Traditional RDBMS, AWS Redshift, AWS RDS/S3
instance, HBase.
• Architecture:
• Apache Spark (Near Real Time Analytics) e.g. SparkR, PySpark, H2O.
• HDInsights, HortonWorks, SpringXD
• Computational:
• Computational power – Number of CPU cores, GPUs, RAM memory
ADULT CENSUS INCOME PREDICTIONS
70% of the data are used for training a model
Remaining 30% used as ‘hold-out’ samples
for trained model’s prediction
Predictions are generated from XGBoost
algorithm, using Gradient Boosted Trees
Training time: < 10 seconds on a Acer Inspire v
15 notebook, Intel Core i7, 12GB RAM
1000 iterations
EVALUATING
ALGORITHMS &
MODELS
TYPES OF ML MODEL EVALUATION METRICS
• Validating prediction model against known outcome/labels.
• For “unsupervised” methods, model is evaluated only by the distance from the “known” clusters
centroid.
• RMSE (Root Means Square Error)
• RMSLE (Root Means Square Logarithm Error)
• MAE (Mean Absolute Error)
• LogLoss (Logarithmic Loss)
• MAP@n (Mean Average Precision @n Classes)
• MLogLoss (Multi Class Logarithmic Loss)
• Hamming Loss
• AUC (Area Under ROC Curve)
• Most commonly used evaluation for binary classifications prediction models
• Range: 0.5 ~ 1.0
 Measure how close the forecasts or predictions
are to the eventual outcomes.
 More suited to regressions models.
 Range (0 - ∞)
 More suited to classification models.
 Range (0 - ∞)
BINARY CLASSIFICATION MODEL EVALUATION
• Gini Lift and Decile Charts
• Ranking predictions and examine how much ‘lift’ does the model provide (NULL model).
• Kolmogorov Smirnov Chart
• Examine how well the model differentiate between 2 classes.
• Confusion Matrix
• Commonly used by medical domain to assess sensitivity vs specificity of tests
AREA UNDER ROC CURVE
Probability >= 0.5,
Predict response
as positive else,
negative
Confusion Matrix
Target
Positive Negative
Model
Positive 1539 368 Positive Pred Rate 0.8070
Negative 839 7022 Negative Pred Rate 0.8933
Sensitivity Specificity
87.643%
0.6472 0.9502
Sensitivity = 64%
1-Specificity = 5%
Sensitivity = True Positive Rate
1-Specificity = False Positive Rate
VISUALIZING
DATA &
STORYTELLIN
G
THE BIG PICTURE – PUTTING IT TOGETHER
0
100
200
300
400
500
600
700
800
900
1000
17 22 27 32 37 42 47 52 57 62 67 72 77 82 87 22 27 32 37 42 47 52 57 62 67 72 77 83
0 1
Age vs Income
Total
Mean Age (target=0): 37
Mean Age (target=1): 44
USING COMBINATION OF CHARTS
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
0
500
1000
1500
2000
2500
3000
114
1055
1409
1639
2036
2176
2346
2463
2653
2961
3137
3432
3781
4064
4650
4934
5556
6497
7298
7978
10566
14344
20051
34095
Capital Gain vs High Income (%)
Count High Income (%)
MAXIMIZING ROI ON MARKETING RESPONSE
• Assumptions:
1. Average loan amount $10,000
2. Interest return at 10%
3. Default rate at 5%
4. Marketing costs 20% of average revenue
5. Simple mechanics of how financing works
QUESTIONS
& ANSWERS
THANK YOU
ECHELON ASIA SUMMIT 2017
Garrett Teoh Hor Keong
Chief Data Officer, Renotalk Pte Ltd
LinkedIn: garrettteoh
Email: rtgteoh@renotalk.com

More Related Content

What's hot

Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning models
Kyriakos Chatzidimitriou
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
tanuvir
 
Ml8 boosting and-stacking
Ml8 boosting and-stackingMl8 boosting and-stacking
Ml8 boosting and-stacking
ankit_ppt
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahout
Gaurav Kasliwal
 
Machine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionMachine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable Conversion
Andrew Ferlitsch
 
Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)
Hayim Makabee
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
Venkata Reddy Konasani
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
Machine Learning Valencia
 
Ml2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionMl2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regression
ankit_ppt
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
HJ van Veen
 
Ml4 naive bayes
Ml4 naive bayesMl4 naive bayes
Ml4 naive bayes
ankit_ppt
 
Branch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection AlgorithmsBranch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection Algorithms
Chamin Nalinda Loku Gam Hewage
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
Roger Barga
 
BSSML17 - Ensembles
BSSML17 - EnsemblesBSSML17 - Ensembles
BSSML17 - Ensembles
BigML, Inc
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusone
DotNetCampus
 
BSSML17 - Logistic Regressions
BSSML17 - Logistic RegressionsBSSML17 - Logistic Regressions
BSSML17 - Logistic Regressions
BigML, Inc
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax audit
Michael BENESTY
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboost
Takami Sato
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
Sri Ambati
 
Feature Reduction Techniques
Feature Reduction TechniquesFeature Reduction Techniques
Feature Reduction Techniques
Vishal Patel
 

What's hot (20)

Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning models
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
 
Ml8 boosting and-stacking
Ml8 boosting and-stackingMl8 boosting and-stacking
Ml8 boosting and-stacking
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahout
 
Machine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionMachine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable Conversion
 
Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 
Ml2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionMl2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regression
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Ml4 naive bayes
Ml4 naive bayesMl4 naive bayes
Ml4 naive bayes
 
Branch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection AlgorithmsBranch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection Algorithms
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
BSSML17 - Ensembles
BSSML17 - EnsemblesBSSML17 - Ensembles
BSSML17 - Ensembles
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusone
 
BSSML17 - Logistic Regressions
BSSML17 - Logistic RegressionsBSSML17 - Logistic Regressions
BSSML17 - Logistic Regressions
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax audit
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboost
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
 
Feature Reduction Techniques
Feature Reduction TechniquesFeature Reduction Techniques
Feature Reduction Techniques
 

Similar to Echelon Asia Summit 2017 Startup Academy Workshop

Introduction to Machine Learning - An overview and first step for candidate d...
Introduction to Machine Learning - An overview and first step for candidate d...Introduction to Machine Learning - An overview and first step for candidate d...
Introduction to Machine Learning - An overview and first step for candidate d...
Lucas Jellema
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
Subrat Panda, PhD
 
Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008
Peter Gfader
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
Mostafa
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
Machine learning for sensor Data Analytics
Machine learning for sensor Data AnalyticsMachine learning for sensor Data Analytics
Machine learning for sensor Data Analytics
MATLABISRAEL
 
The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...
The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...
The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...
Lucas Jellema
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systems
Francesca Lazzeri, PhD
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
SATHVIK MANIKANTAN N U
 
Machine learning
Machine learning Machine learning
Machine learning
Aarthi Srinivasan
 
Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...
Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...
Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...
Charlie Berger
 
Introduction overviewmachinelearning sig Door Lucas Jellema
Introduction overviewmachinelearning sig Door Lucas JellemaIntroduction overviewmachinelearning sig Door Lucas Jellema
Introduction overviewmachinelearning sig Door Lucas Jellema
Getting value from IoT, Integration and Data Analytics
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
sumit621
 
Machine Learning in Autonomous Data Warehouse
 Machine Learning in Autonomous Data Warehouse Machine Learning in Autonomous Data Warehouse
Machine Learning in Autonomous Data Warehouse
Sandesh Rao
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Greg Makowski
 
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Intel® Software
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStats
Jen Stirrup
 
Data imputation for unstructured dataset
Data imputation for unstructured datasetData imputation for unstructured dataset
Data imputation for unstructured dataset
Vibhore Agarwal
 
The Art of Intelligence – Introduction Machine Learning for Oracle profession...
The Art of Intelligence – Introduction Machine Learning for Oracle profession...The Art of Intelligence – Introduction Machine Learning for Oracle profession...
The Art of Intelligence – Introduction Machine Learning for Oracle profession...
Lucas Jellema
 
2016 Pittsburgh Data Jam Student Workshop
2016 Pittsburgh Data Jam Student Workshop2016 Pittsburgh Data Jam Student Workshop
2016 Pittsburgh Data Jam Student Workshop
Matthew DeReno
 

Similar to Echelon Asia Summit 2017 Startup Academy Workshop (20)

Introduction to Machine Learning - An overview and first step for candidate d...
Introduction to Machine Learning - An overview and first step for candidate d...Introduction to Machine Learning - An overview and first step for candidate d...
Introduction to Machine Learning - An overview and first step for candidate d...
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Machine learning for sensor Data Analytics
Machine learning for sensor Data AnalyticsMachine learning for sensor Data Analytics
Machine learning for sensor Data Analytics
 
The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...
The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...
The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systems
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Machine learning
Machine learning Machine learning
Machine learning
 
Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...
Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...
Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...
 
Introduction overviewmachinelearning sig Door Lucas Jellema
Introduction overviewmachinelearning sig Door Lucas JellemaIntroduction overviewmachinelearning sig Door Lucas Jellema
Introduction overviewmachinelearning sig Door Lucas Jellema
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
Machine Learning in Autonomous Data Warehouse
 Machine Learning in Autonomous Data Warehouse Machine Learning in Autonomous Data Warehouse
Machine Learning in Autonomous Data Warehouse
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
 
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStats
 
Data imputation for unstructured dataset
Data imputation for unstructured datasetData imputation for unstructured dataset
Data imputation for unstructured dataset
 
The Art of Intelligence – Introduction Machine Learning for Oracle profession...
The Art of Intelligence – Introduction Machine Learning for Oracle profession...The Art of Intelligence – Introduction Machine Learning for Oracle profession...
The Art of Intelligence – Introduction Machine Learning for Oracle profession...
 
2016 Pittsburgh Data Jam Student Workshop
2016 Pittsburgh Data Jam Student Workshop2016 Pittsburgh Data Jam Student Workshop
2016 Pittsburgh Data Jam Student Workshop
 

Recently uploaded

一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 

Recently uploaded (20)

一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 

Echelon Asia Summit 2017 Startup Academy Workshop

  • 1. ECHELON ASIA SUMMIT 2017 STARTUP ACADEMY [WORKSHOP] INTRODUCTION TO DATA SCIENCE 29th June 2017 Garrett Teoh Hor Keong
  • 3. PROGRAM FLOW 1. Data Science Fundamentals (10 min) 2. Exploratory Data Analysis (25 min) 3. Building Machine Learning & AI (10 min) 4. Evaluating Algorithms & Models (20 min) 5. Visualizing Data & Storytelling (20 min) 6. Questions & Answers (5 min)
  • 5. STAGES OF DATA SCIENCE What has happened? What will happen? What should happen? Data Collection Machine Learning Cognitive Actionable Insights!Visualizations / Storytelling Exploratory Data Analysis Classifications
  • 6. CROSS INDUSTRY STANDARD PROCESS – DATA MINING Business Understanding Collect & Understand Data Data Prep & Cleansing Build AI & Models Evaluate Models Deploy & Productionalize Data Lake Local vs Cloud? What has happened? What will happen? What should happen? 1 2 3 6 5 4
  • 7. DOMAINS OF DATA SCIENCE Supervised Learning - Species Classifications - HR Churn - Sales Conversion - Performance Ranking Unsupervised Learning - Credit Card Fraud - Procurement Fraud - Preventive Maintenance Imaging & Recognition - Facial Recognition - Product Categories - Healthcare Imaging Operations Research - Optimizing Costs vs Revenue (HR Planning) - Optimizing Costs for Machines, Pipes to Gas Stations (Revenue) Recommend Engine - Collaborative Filtering - Cross-Sell Products
  • 8. TOOLS FOR DATA SCIENCE
  • 9. DRIVING TOWARDS DIGITAL TRANSFORMATION  Data Scientists (Building Models, Evaluation)  Data Analysts (Visualizations, reports, EDA)  Data Engineer (Data Lake, Deployment, ETL)  IT Developers (Deployment, Data Collections)  Internal (Employees, Accounts, Audit Logs, Marketing)  External (Sales, Customers Behaviours, Measurements)  Public (Census, Info sites, Facebook, Twitter, New & Media)  Data Aggregator Companies  Data Storage  Data Processing & ETLs  Data Access & Governance  Computational Resource  Real Time Processing  Visualization Tools  Data Modelling Tools  Deployment Tools
  • 11. ADULT CENSUS INCOME DATASET – BACKGROUND This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). The prediction task is to determine whether a person makes over $50K a year. Link to data: https://goo.gl/qE7TPf (adult.csv.zip)
  • 12. ADULT CENSUS INCOME DATASET – UNDERSTANDING Link to data description: https://www.kaggle.com/uciml/adult-census-income Response (Binary) Features or Predictors (14) Data Types: Integer Continuous Binary Date/Time Ordinal Categorical Text
  • 13. PREPARING & CLEANING UP THE DATASET Explore how to use Excel Sheet (xlsx) to prepare and clean up the Adult Census Income dataset. Step 1 • Convert raw data from .csv format to .xlsx format. “save as…” Step 2 • Click on “sort & filter” to examine data type and categories. Step 3 • Identify blanks, missing data, or irrelevant data. Step 4 • Alternatively, use “pivot tables” and “charts” to identify distribution and categorical counts. Select all data using ctrl+shift+arrow keys -> click on insert pivot tables -> new worksheet. Step 5 • Create a derived binary response (using “IF” function to return 0 or 1). Step 6 • Use “VLOOKUP” to replace blanks, missing or irrelevant data. Step 7 • Insert “combo clustered” 2-D chart using the data on pivot table to examine correlation of response between each feature. Step 8 • Remove features with high % of missing data.
  • 14. NUMERICAL FEATURES DISTRIBUTION & RECODING Some numerical (continuous or integers) features might be slightly correlated to the response, and thus it is important to identify the trends of these features and recode them as necessarily. Step 1 • Examine correlations of the continuous feature with response or using parametric (Student’s T-test) /non-parametric (Wilcoxon ranked) tests. Step 2 • Observe the histogram plot of the continuous feature with response by making a “combo clustered” or a “scattered plot” Step 3 • Identify highly correlated segments and recode feature Mean Age (target=0): 37 Mean Age (target=1): 44
  • 15. ADULT CENSUS INCOME DATASET – EDA PRACTICE The cleaned data can be downloaded from https://goo.gl/qE7TPf (cleaned-adult.zip)
  • 16. EXPLORATORY DATA ANALYSIS – CORRELATION PLOT relationships Female Male Grand Total Husband 0.01% 99.99% 100.00% Wife 99.87% 0.13% 100.00% Not-in-family 46.66% 53.34% 100.00% Other-relative 43.83% 56.17% 100.00% Own-child 44.30% 55.70% 100.00% Unmarried 77.02% 22.98% 100.00% Grand Total 33.08% 66.92% 100.00% 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 120.00% Husband Wife Not-in-family Other-relative Own-child Unmarried Relationship vs Gender Female (%) Male (%)
  • 17. EXPLORATORY DATA ANALYSIS SUMMARY Executive Summary (What has happened?) 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 0 2000 4000 6000 8000 10000 12000 0 1 2 3 4 5 Age Group vs High Income counts high income (%) 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 0 2000 4000 6000 8000 10000 12000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Education num vs High Income count high income (%) 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 50.00% 0 2000 4000 6000 8000 10000 12000 14000 Relationships vs High Income count high income (%) Overall High Income 24%
  • 18. EXPLORATORY DATA ANALYSIS SUMMARY Executive Summary (What has happened?) Higher earned incomers tend to have a significantly higher capital gain/loss. Both this features might improve prediction modelling performance.
  • 20. MACHINE LEARNING ALGORITHMS – UNSUPERVISED • You do not know what you don’t have an idea • All data is unlabelled and the algorithms learn to inherent structure from the input data. • You only have input data (X) and no corresponding output variables.  How a fraudster does it?  When will it happen?  How to differentiate them?  Where are the anomalies? Telco Fraud People Management  Who is the top performer?  What are the metrics?  Who to award a promotion?  Where do they stand out? Product Cross/Up Sell  Who will need those products?  What is inside their shopping carts?  Which products to market?  How to package products?
  • 21. MACHINE LEARNING ALGORITHMS – UNSUPERVISED CLUSTERING Hierarchical Clustering K - Means Kernel Density Discriminant Analysis Isolation Forest One-Class SVM ASSOCIATIONS Apriori Eclat FP- Growth Context Based
  • 22. MACHINE LEARNING ALGORITHMS – SUPERVISED • You do not know what you knew • All data is labelled and the algorithms learn to predict the output from the input data • you have input variables (x) and an output variable (Y)  How a lead will convert?  What features or properties are important?  How to deal with leads with marginal probability? Leads Conversion Financing  Who is a good borrower?  Who will default on a loan?  Rules or pattern to differentiate them?  How to interpret probabilities of default? Property Sales  What is the best price?  What features affect sale price?  Do price affects sale probability?  Optimizing time, price, ability to close a sales?
  • 23. MACHINE LEARNING ALGORITHMS – SUPERVISED CLASSIFICATIONS REGRESSIONS - Decision Tree, Random Forest - eXtreme Gradient BOOSTing (XGBOOST) - Gradient Boosted Trees - Generalised Linear Model - Logistic Regression - Neural Networks - Support Vector Machine (SVM) - K Nearest Neighbour (KNN), K Means - eXtreme Gradient BOOSTing (XGBOOST) - Linear Gradient Boosted - Generalised Linear Model - Lasso, Ridge Regression - Elastic Net - Least Angle Regression (LARS) - Neural Networks
  • 24. TOOLS & RESOURCES CONSIDERATIONS • Near real time updates and monitoring. (e.g. Pricing Analysis, Recommendation Engine, Threat/Fraud Detection, Preventive Maintenance) • Periodic updates. (People Analysis, Marketing Response Prediction, Sales Forecast, Cancer/Disease Risk) • Predict-On-Demand. (Credit Risk/Scoring, Leads Conversion) • Storage: • Hadoop Distributed File System (HDFS), Traditional RDBMS, AWS Redshift, AWS RDS/S3 instance, HBase. • Architecture: • Apache Spark (Near Real Time Analytics) e.g. SparkR, PySpark, H2O. • HDInsights, HortonWorks, SpringXD • Computational: • Computational power – Number of CPU cores, GPUs, RAM memory
  • 25. ADULT CENSUS INCOME PREDICTIONS 70% of the data are used for training a model Remaining 30% used as ‘hold-out’ samples for trained model’s prediction Predictions are generated from XGBoost algorithm, using Gradient Boosted Trees Training time: < 10 seconds on a Acer Inspire v 15 notebook, Intel Core i7, 12GB RAM 1000 iterations
  • 27. TYPES OF ML MODEL EVALUATION METRICS • Validating prediction model against known outcome/labels. • For “unsupervised” methods, model is evaluated only by the distance from the “known” clusters centroid. • RMSE (Root Means Square Error) • RMSLE (Root Means Square Logarithm Error) • MAE (Mean Absolute Error) • LogLoss (Logarithmic Loss) • MAP@n (Mean Average Precision @n Classes) • MLogLoss (Multi Class Logarithmic Loss) • Hamming Loss • AUC (Area Under ROC Curve) • Most commonly used evaluation for binary classifications prediction models • Range: 0.5 ~ 1.0  Measure how close the forecasts or predictions are to the eventual outcomes.  More suited to regressions models.  Range (0 - ∞)  More suited to classification models.  Range (0 - ∞)
  • 28. BINARY CLASSIFICATION MODEL EVALUATION • Gini Lift and Decile Charts • Ranking predictions and examine how much ‘lift’ does the model provide (NULL model). • Kolmogorov Smirnov Chart • Examine how well the model differentiate between 2 classes. • Confusion Matrix • Commonly used by medical domain to assess sensitivity vs specificity of tests
  • 29. AREA UNDER ROC CURVE Probability >= 0.5, Predict response as positive else, negative Confusion Matrix Target Positive Negative Model Positive 1539 368 Positive Pred Rate 0.8070 Negative 839 7022 Negative Pred Rate 0.8933 Sensitivity Specificity 87.643% 0.6472 0.9502 Sensitivity = 64% 1-Specificity = 5% Sensitivity = True Positive Rate 1-Specificity = False Positive Rate
  • 31. THE BIG PICTURE – PUTTING IT TOGETHER 0 100 200 300 400 500 600 700 800 900 1000 17 22 27 32 37 42 47 52 57 62 67 72 77 82 87 22 27 32 37 42 47 52 57 62 67 72 77 83 0 1 Age vs Income Total Mean Age (target=0): 37 Mean Age (target=1): 44
  • 32. USING COMBINATION OF CHARTS 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 120.00% 0 500 1000 1500 2000 2500 3000 114 1055 1409 1639 2036 2176 2346 2463 2653 2961 3137 3432 3781 4064 4650 4934 5556 6497 7298 7978 10566 14344 20051 34095 Capital Gain vs High Income (%) Count High Income (%)
  • 33. MAXIMIZING ROI ON MARKETING RESPONSE • Assumptions: 1. Average loan amount $10,000 2. Interest return at 10% 3. Default rate at 5% 4. Marketing costs 20% of average revenue 5. Simple mechanics of how financing works
  • 35. THANK YOU ECHELON ASIA SUMMIT 2017 Garrett Teoh Hor Keong Chief Data Officer, Renotalk Pte Ltd LinkedIn: garrettteoh Email: rtgteoh@renotalk.com