Echelon Asia Summit 2017 Startup Academy Workshop

ECHELON
ASIA SUMMIT 2017
STARTUP ACADEMY
[WORKSHOP]
INTRODUCTION TO
DATA SCIENCE
29th June 2017
Garrett Teoh Hor Keong

PROGRAM FLOW
1. Data Science Fundamentals
(10 min)
2. Exploratory Data Analysis
(25 min)
3. Building Machine Learning & AI
(10 min)
4. Evaluating Algorithms & Models
(20 min)
5. Visualizing Data & Storytelling
(20 min)
6. Questions & Answers
(5 min)

STAGES OF DATA SCIENCE
What has
happened?
What will
happen?
What should
happen?
Data Collection Machine Learning Cognitive
Actionable Insights!Visualizations / Storytelling
Exploratory Data Analysis
Classifications

CROSS INDUSTRY STANDARD PROCESS – DATA
MINING
Business
Understanding
Collect &
Understand
Data
Data Prep
&
Cleansing
Build
AI & Models
Evaluate
Models
Deploy &
Productionalize Data Lake
Local vs Cloud?
What has happened?
What will happen?
What should happen?
1
2
3
6
5
4

DOMAINS OF DATA SCIENCE
Supervised
Learning
- Species
Classifications
- HR Churn
- Sales
Conversion
- Performance
Ranking
Unsupervised
Learning
- Credit Card
Fraud
- Procurement
Fraud
- Preventive
Maintenance
Imaging &
Recognition
- Facial
Recognition
- Product
Categories
- Healthcare
Imaging
Operations
Research
- Optimizing
Costs vs Revenue
(HR Planning)
- Optimizing
Costs for
Machines, Pipes
to Gas Stations
(Revenue)
Recommend
Engine
- Collaborative
Filtering
- Cross-Sell
Products

DRIVING TOWARDS DIGITAL TRANSFORMATION
 Data Scientists (Building Models, Evaluation)
 Data Analysts (Visualizations, reports, EDA)
 Data Engineer (Data Lake, Deployment, ETL)
 IT Developers (Deployment, Data Collections)
 Internal (Employees, Accounts, Audit Logs, Marketing)
 External (Sales, Customers Behaviours, Measurements)
 Public (Census, Info sites, Facebook, Twitter, New & Media)
 Data Aggregator Companies
 Data Storage
 Data Processing & ETLs
 Data Access & Governance
 Computational Resource
 Real Time Processing
 Visualization Tools
 Data Modelling Tools
 Deployment Tools

ADULT CENSUS INCOME DATASET – BACKGROUND
This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining
and Visualization, Silicon Graphics). The prediction task is to determine whether a person makes over $50K a year.
Link to data: https://goo.gl/qE7TPf (adult.csv.zip)

ADULT CENSUS INCOME DATASET – UNDERSTANDING
Link to data description: https://www.kaggle.com/uciml/adult-census-income
Response (Binary)
Features or
Predictors (14)
Data Types:
Integer
Continuous
Binary
Date/Time
Ordinal
Categorical
Text

PREPARING & CLEANING UP THE DATASET
Explore how to use Excel Sheet (xlsx) to prepare and clean up the Adult Census Income dataset.
Step 1 • Convert raw data from .csv format to .xlsx format. “save as…”
Step 2 • Click on “sort & filter” to examine data type and categories.
Step 3 • Identify blanks, missing data, or irrelevant data.
Step 4
• Alternatively, use “pivot tables” and “charts” to identify distribution and categorical
counts. Select all data using ctrl+shift+arrow keys -> click on insert pivot tables ->
new worksheet.
Step 5 • Create a derived binary response (using “IF” function to return 0 or 1).
Step 6 • Use “VLOOKUP” to replace blanks, missing or irrelevant data.
Step 7
• Insert “combo clustered” 2-D chart using the data on pivot table to examine
correlation of response between each feature.
Step 8 • Remove features with high % of missing data.

NUMERICAL FEATURES DISTRIBUTION & RECODING
Some numerical (continuous or integers) features might be slightly correlated to the response, and thus it is
important to identify the trends of these features and recode them as necessarily.
Step 1
• Examine correlations of the continuous feature with response or using parametric
(Student’s T-test) /non-parametric (Wilcoxon ranked) tests.
Step 2
• Observe the histogram plot of the continuous feature with response by making a
“combo clustered” or a “scattered plot”
Step 3 • Identify highly correlated segments and recode feature
Mean Age (target=0): 37

ADULT CENSUS INCOME DATASET – EDA PRACTICE
The cleaned data can be downloaded from https://goo.gl/qE7TPf (cleaned-adult.zip)

EXPLORATORY DATA ANALYSIS – CORRELATION PLOT
relationships Female Male Grand Total
Husband 0.01% 99.99% 100.00%
Wife 99.87% 0.13% 100.00%
Not-in-family 46.66% 53.34% 100.00%
Other-relative 43.83% 56.17% 100.00%
Own-child 44.30% 55.70% 100.00%
Unmarried 77.02% 22.98% 100.00%
Grand Total 33.08% 66.92% 100.00%
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
Husband Wife Not-in-family Other-relative Own-child Unmarried
Relationship vs Gender
Female (%) Male (%)

EXPLORATORY DATA ANALYSIS SUMMARY
Executive Summary (What has happened?)
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
0
2000
4000
6000
8000
10000
12000
0 1 2 3 4 5
Age Group vs High Income
counts high income (%)
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
0
2000
4000
6000
8000
10000
12000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Education num vs High Income
count high income (%)
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
50.00%
0
2000
4000
6000
8000
10000
12000
14000
Relationships vs High Income
count high income (%)
Overall High Income 24%

EXPLORATORY DATA ANALYSIS SUMMARY
Executive Summary (What has happened?)
Higher earned
incomers tend to
have a significantly
higher capital
gain/loss.
Both this features
might improve
prediction modelling
performance.

BUILDING
MACHINE
LEARNING &
AI

MACHINE LEARNING ALGORITHMS – UNSUPERVISED
• You do not know what you don’t have an idea
• All data is unlabelled and the algorithms learn to inherent structure from the input data.
• You only have input data (X) and no corresponding output variables.
 How a fraudster does it?
 When will it happen?
 How to differentiate them?
 Where are the anomalies?
Telco Fraud People Management
 Who is the top performer?
 What are the metrics?
 Who to award a promotion?
 Where do they stand out?
Product Cross/Up Sell
 Who will need those products?
 What is inside their shopping carts?
 Which products to market?
 How to package products?

MACHINE LEARNING ALGORITHMS – UNSUPERVISED
CLUSTERING
Hierarchical
Clustering
K -
Means
Kernel
Density
Discriminant
Analysis
Isolation
Forest
One-Class
SVM
ASSOCIATIONS
Apriori
Eclat
FP-
Growth
Context
Based

MACHINE LEARNING ALGORITHMS – SUPERVISED
• You do not know what you knew
• All data is labelled and the algorithms learn to predict the output from the input data
• you have input variables (x) and an output variable (Y)
 How a lead will convert?
 What features or properties
are important?
 How to deal with leads with
marginal probability?
Leads Conversion Financing
 Who is a good borrower?
 Who will default on a loan?
 Rules or pattern to
differentiate them?
 How to interpret
probabilities of default?
Property Sales
 What is the best price?
 What features affect sale price?
 Do price affects sale probability?
 Optimizing time, price, ability to
close a sales?

MACHINE LEARNING ALGORITHMS – SUPERVISED
CLASSIFICATIONS REGRESSIONS
- Decision Tree, Random Forest
- eXtreme Gradient BOOSTing (XGBOOST)
- Gradient Boosted Trees
- Generalised Linear Model
- Logistic Regression
- Neural Networks
- Support Vector Machine (SVM)
- K Nearest Neighbour (KNN), K Means
- eXtreme Gradient BOOSTing (XGBOOST)
- Linear Gradient Boosted
- Generalised Linear Model
- Lasso, Ridge Regression
- Elastic Net
- Least Angle Regression (LARS)
- Neural Networks

TOOLS & RESOURCES CONSIDERATIONS
• Near real time updates and monitoring. (e.g. Pricing Analysis, Recommendation Engine,
Threat/Fraud Detection, Preventive Maintenance)
• Periodic updates. (People Analysis, Marketing Response Prediction, Sales Forecast, Cancer/Disease
Risk)
• Predict-On-Demand. (Credit Risk/Scoring, Leads Conversion)
• Storage:
• Hadoop Distributed File System (HDFS), Traditional RDBMS, AWS Redshift, AWS RDS/S3
instance, HBase.
• Architecture:
• Apache Spark (Near Real Time Analytics) e.g. SparkR, PySpark, H2O.
• HDInsights, HortonWorks, SpringXD
• Computational:
• Computational power – Number of CPU cores, GPUs, RAM memory

ADULT CENSUS INCOME PREDICTIONS
70% of the data are used for training a model
Remaining 30% used as ‘hold-out’ samples
for trained model’s prediction
Predictions are generated from XGBoost
algorithm, using Gradient Boosted Trees
Training time: < 10 seconds on a Acer Inspire v
15 notebook, Intel Core i7, 12GB RAM
1000 iterations

EVALUATING
ALGORITHMS &
MODELS

TYPES OF ML MODEL EVALUATION METRICS
• Validating prediction model against known outcome/labels.
• For “unsupervised” methods, model is evaluated only by the distance from the “known” clusters
centroid.
• RMSE (Root Means Square Error)
• RMSLE (Root Means Square Logarithm Error)
• MAE (Mean Absolute Error)
• LogLoss (Logarithmic Loss)
• MAP@n (Mean Average Precision @n Classes)
• MLogLoss (Multi Class Logarithmic Loss)
• Hamming Loss
• AUC (Area Under ROC Curve)
• Most commonly used evaluation for binary classifications prediction models
• Range: 0.5 ~ 1.0
 Measure how close the forecasts or predictions
are to the eventual outcomes.
 More suited to regressions models.
 Range (0 - ∞)
 More suited to classification models.
 Range (0 - ∞)

BINARY CLASSIFICATION MODEL EVALUATION
• Gini Lift and Decile Charts
• Ranking predictions and examine how much ‘lift’ does the model provide (NULL model).
• Kolmogorov Smirnov Chart
• Examine how well the model differentiate between 2 classes.
• Confusion Matrix
• Commonly used by medical domain to assess sensitivity vs specificity of tests

AREA UNDER ROC CURVE
Probability >= 0.5,
Predict response
as positive else,
negative
Confusion Matrix
Target
Positive Negative
Model
Positive 1539 368 Positive Pred Rate 0.8070
Negative 839 7022 Negative Pred Rate 0.8933
Sensitivity Specificity
87.643%
0.6472 0.9502
Sensitivity = 64%
1-Specificity = 5%
Sensitivity = True Positive Rate
1-Specificity = False Positive Rate

VISUALIZING
DATA &
STORYTELLIN
G

THE BIG PICTURE – PUTTING IT TOGETHER
0
100
200
300
400
500
600
700
800
900
1000
17 22 27 32 37 42 47 52 57 62 67 72 77 82 87 22 27 32 37 42 47 52 57 62 67 72 77 83
0 1
Age vs Income
Total

USING COMBINATION OF CHARTS
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
0
500
1000
1500
2000
2500
3000
114
1055
1409
1639
2036
2176
2346
2463
2653
2961
3137
3432
3781
4064
4650
4934
5556
6497
7298
7978
10566
14344
20051
34095
Capital Gain vs High Income (%)
Count High Income (%)

MAXIMIZING ROI ON MARKETING RESPONSE
• Assumptions:
1. Average loan amount $10,000
2. Interest return at 10%
3. Default rate at 5%
4. Marketing costs 20% of average revenue
5. Simple mechanics of how financing works

THANK YOU
ECHELON ASIA SUMMIT 2017
Garrett Teoh Hor Keong
Chief Data Officer, Renotalk Pte Ltd
LinkedIn: garrettteoh
Email: rtgteoh@renotalk.com

Echelon Asia Summit 2017 Startup Academy Workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Echelon Asia Summit 2017 Startup Academy Workshop

Similar to Echelon Asia Summit 2017 Startup Academy Workshop (20)

Recently uploaded

Recently uploaded (20)

Echelon Asia Summit 2017 Startup Academy Workshop