Aiday

AGENDA - DATA SCIENCE IN PRACTICE
• The ”compact” version of data science activities
• Data science process breakdown step by step
• Vote to see deep / reinforcement learning demo

Data Science is an
iterative
process...

SIMPLIFICATION OF DATA SCIENCE
PROCESS
(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption

BUSINESS UNDERSTANDING & DATA UNDERSTANDING
(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption

Data Science is an
iterative
process,
and EVERY decision a Data
Scientist made in the process is a
Trade-off

Business+Domain expertise is
everything !!!
Exhibit A


DATA PROCESSING & FEATURE
ENGINEERING
(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption

DATA PROCESSING =
WRANGLING / TRANSFORMATION
- MISSING DATA
- REDUCE DIMENSION ( SOM, PCA, SVD...ETC)
(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption

Deal with Missing Data
Remove/keep ?Exhibit B


DIMENSION REDUCTION – PCA
10 dimensions
2 dimensions
Ockham's razor - More things should not be used than
are necessary !
Exhibit C

FEATURE ENGINEERING
WHY FEATURE ENGINEERING ?
(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption

explain Correlation with a metaphor
Interval of distance
Direction to the
right
A B

Observation Interval of
distance
Direction to the
right
A B
Highly correlated(0.75~1) : Tesla car and Volvo car moving almost at the same speed and
toward the same direction
Negatively correlated(<0) : Tesla car and Volvo car moving toward different directions
Positively correlated (0.5 ~0.75) : Tesla car move a bit faster than Volvo car but they are
still both heading at the same direction
explain Correlation with a metaphor
continued
Distance=1-corr 
0
Distance=1-corr  0.25-
0.5
Distance=1-corr  0. 5- 1

MODEL SELECTION & PERFORMANCE
EVALUATION
(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption

Q : do you know
the answer to the
question you asked
?
Supervise
d
learning
Regressio
ns
Classes
Unsupervis
ed learning
Deep
learning
Clusterin
g
Associati
on
analysis
Ye
s
No

GENERIC SUPERVISED MACHINE LEARINNG
FLOW
(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption

MODEL PERFORMANCE EVALUATION
USE SUPERVISED LEARNING AS AN EXAMPLE
(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption

NAIVE WAY TO LOOK AT IT – ACCURACY !
Just
guessing
Better than
guessing

ZOOM-IN, WHAT IS MORE IMPORTANT ?
Breast
cancer
Recurrent
(=1)
Breat cancer
Not
recurrent(=0)
Breast Cancer
Recurrent(=1)
True Positive
Type II
error
(False
Negative)
Breast Cancer
Not
recurrent(=0)
Type I
error
(False
Positive)
True
Negative
PredictedLabel
True Label
Prediction says that you dont have
BreastCancer
but you acutlaly DO !!!

TWEAK ...
False Negative drop from 6  1

OTHER IMPORTANT CRITERIAS TO CONSIDER
Model
performance
Practical
stuff

MODEL DEPLOYMENT & CONSUMPTION
CHOICES
(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption

COMSUME MODELS
BreastCancer
Scenario

DEEP LEARNING & REINFORCEMENT LEARNING
DEMO

Detect people & car
Computer Vision
self-navigate
Q-learning
OpenAI gym
Mountain Car -0
env-
opencv+ImageAI env-
py36_kivy
env- unity
Crowd simulation
(A* algorithm)
env-
py36_kivy

GAN – GENERAL ADVERSIAL NETWORK

(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption

Supervised Learning
Regressions:
Linear Regression
Step-wised Regression
Piecewise Polynomials and splines
Smoothing Splines
Logistic Regression
Multivariate Adaptive Regression Splines
Least Absolute Shrinkage and Selection Operator (LASSO)
Ridge Regression
Linear Discriminant Analysis (LDA)
Trees :
Decision trees
Gradient Boosted Regression trees
Adaptive Boosting trees (AdaBoost)
Conditional Inference trees (CI trees)
Bootstrap Aggregation (Bagging) trees
Gradient Boosted Machines(GBM)
Random Forest (RF)
Support Vector Machines (SVM) :
Support vector classifier (two class)
Support vector classifier (multiclass)
Kernels and support vector machines
Dimensionality reduction:
Principal Component Analysis(PCA)
Singular Value Decomposition (SVD)
MinHash
Locality Sensitive Hashing(LSH)
t-Distributed Stochastic Neighbor embedding (t-SNE)
Clustering :
Kmeans Clustering
Hierarchical Clustering
Bradley-Fayyad-Reina (BFR) clustering
Clustering Using REpresentatives CURE clustering
Bayesian networks
Topic modelling
Market Basket :
Apriori (association rules)
Park Chen and Yu algorithm (PCY)
Savasere, Omiecinski and Navathe (SON)
Toivonen’s algorithm
Stream Analysis :
Bloom filters
Flajolet-Martin Algorithm
Alon-Matias-Szegedy
Datar-Gionis-Indyk-Motwani algorithm
Unsupervised Learning
NeuralNetwork families
Deep Learning
Perceptrons
Simple Neural Networks (fully
connected )
Deep Boltzmann machines
Convolutional neural networks
Recurrent neural networks
Genetic algorithm (chromosome)
Multi-arm bandit
K’s Nearest Neighbors (KNN)
Content based recommender
User-User recommender
Item-item recommender
Hybrid recommender
Latent Dirichlet Allocation
Recommender Systems
Others
Others

Supervised Learning
Regressions: What kind of problem it addresses
Linear Regression
Step-wised Regression
Piecewise Polynomials and splines
Smoothing Splines
Logistic Regression
Multivariate Adaptive Regression Splines
Least Absolute Shrinkage and Selection Operator (LASSO)
Ridge Regression
Linear Discriminant Analysis (LDA)
Trees : What kind of problem it address
Decision trees
Gradient Boosted Regression trees
Adaptive Boosting trees (AdaBoost)
Conditional Inference trees (CI trees)
Bootstrap Aggregation (Bagging) trees
Gradient Boosted Machines(GBM)
Random Forest (RF)
Support Vector Machines (SVM) : problem it address
Support vector classifier (two class)
Support vector classifier (multiclass)
Kernels and support vector machines
Dimensionality reduction: type of problems it address
Principal Component Analysis(PCA)
Singular Value Decomposition (SVD)
MinHash
Locality Sensitive Hashing(LSH)
t-Distributed Stochastic Neighbor embedding (t-SNE)
Clustering :
Kmeans Clustering
Hierarchical Clustering
Bradley-Fayyad-Reina (BFR) clustering
Clustering Using REpresentatives CURE clustering
Bayesian networks
Topic modelling
Market Basket : type of problems it address
Apriori (association rules)
Park Chen and Yu algorithm (PCY)
Savasere, Omiecinski and Navathe (SON)
Toivonen’s algorithm
Stream Analysis : problems it address
Bloom filters
Flajolet-Martin Algorithm
Alon-Matias-Szegedy
Datar-Gionis-Indyk-Motwani algorithm
Unsupervised Learning
NeuralNetwork families
Deep Learning
Perceptrons
Simple Neural Networks (fully
connected )
Deep Boltzmann machines
Convolutional neural networks
Recurrent neural networks
Genetic algorithm (chromosome)
Multi-arm bandit
K’s Nearest Neighbors (KNN)
Content based recommender
User-User recommender
Item-item recommender
Hybrid recommender
Latent Dirichlet Allocation
Recommender Systems
Others
Others
How many patients will enter
ICU in a given time ?
Does patient number X having
Breast cancer Yes/No ?
Does patient number X having
Breast cancer Yes/No ?
I have 1200 features (age, gender,
income, diagostic codes, hospital
visits...etc) in my data about the
patients, is there a simplier way to
group those features ?
Diet preference - If I love eating
strawberries, will I also like
raspberries ?
How do i count a live-feed camera
(=stream) # of patients passing by
this check-point ?
(1) Object detection – people
& cars
(2) NLP – respond to a
sentence
Predict rating of hospitals

Frequently used
english letters
Countries /cities ?
Relative positions
?
Frequently used
nouns in a
sentense ?
Frequently used past tensed
verbs
Geographic
regions &
nationalities
Education
facilities
Concept of
good/just ?

Aiday

Recommended

Recommended

More Related Content

Similar to Aiday

Similar to Aiday (20)

More from Zenodia Charpy

More from Zenodia Charpy (8)

Recently uploaded

Recently uploaded (20)

Aiday

Editor's Notes