SlideShare a Scribd company logo
1 of 36
AGENDA - DATA SCIENCE IN PRACTICE
• The ”compact” version of data science activities
• Data science process breakdown step by step
• Vote to see deep / reinforcement learning demo
Data Science is an
iterative
process...
SIMPLIFICATION OF DATA SCIENCE
PROCESS
(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption
BUSINESS UNDERSTANDING & DATA UNDERSTANDING
(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption
Data Science is an
iterative
process,
and EVERY decision a Data
Scientist made in the process is a
Trade-off
Business+Domain expertise is
everything !!!
Exhibit A

DATA PROCESSING & FEATURE
ENGINEERING
(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption
DATA PROCESSING =
WRANGLING / TRANSFORMATION
- MISSING DATA
- REDUCE DIMENSION ( SOM, PCA, SVD...ETC)
(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption
Deal with Missing Data
Remove/keep ?Exhibit B

DIMENSION REDUCTION – PCA
10 dimensions
2 dimensions
Ockham's razor - More things should not be used than
are necessary !
Exhibit C
FEATURE ENGINEERING
WHY FEATURE ENGINEERING ?
(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption
explain Correlation with a metaphor
Interval of distance
Direction to the
right
A B
Observation Interval of
distance
Direction to the
right
A B
Highly correlated(0.75~1) : Tesla car and Volvo car moving almost at the same speed and
toward the same direction
Negatively correlated(<0) : Tesla car and Volvo car moving toward different directions
Positively correlated (0.5 ~0.75) : Tesla car move a bit faster than Volvo car but they are
still both heading at the same direction
explain Correlation with a metaphor
continued
Distance=1-corr 
0
Distance=1-corr  0.25-
0.5
Distance=1-corr  0. 5- 1
Exhibit
D
MODEL SELECTION & PERFORMANCE
EVALUATION
(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption
Q : do you know
the answer to the
question you asked
?
Supervise
d
learning
Regressio
ns
Classes
Unsupervis
ed learning
Deep
learning
Clusterin
g
Associati
on
analysis
Ye
s
No
GENERIC SUPERVISED MACHINE LEARINNG
FLOW
(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption
Trai
n
Te
st
MODEL PERFORMANCE EVALUATION
USE SUPERVISED LEARNING AS AN EXAMPLE
(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption
NAIVE WAY TO LOOK AT IT – ACCURACY !
Just
guessing
Better than
guessing
ZOOM-IN, WHAT IS MORE IMPORTANT ?
Breast
cancer
Recurrent
(=1)
Breat cancer
Not
recurrent(=0)
Breast Cancer
Recurrent(=1)
True Positive
Type II
error
(False
Negative)
Breast Cancer
Not
recurrent(=0)
Type I
error
(False
Positive)
True
Negative
PredictedLabel
True Label
Prediction says that you dont have
BreastCancer
but you acutlaly DO !!!
TWEAK ...
False Negative drop from 6  1
OTHER IMPORTANT CRITERIAS TO CONSIDER
Model
performance
Practical
stuff
MODEL DEPLOYMENT & CONSUMPTION
CHOICES
(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption
COMSUME MODELS
BreastCancer
Scenario
DEEP LEARNING & REINFORCEMENT LEARNING
DEMO
Detect people & car
Computer Vision
self-navigate
Q-learning
OpenAI gym
Mountain Car -0
env-
opencv+ImageAI env-
py36_kivy
env- unity
Crowd simulation
(A* algorithm)
env-
py36_kivy
GAN – GENERAL ADVERSIAL NETWORK
PACKING SPACE
(1) Business
Understanding
+
(2) Data
understanding
(3) Data processing
+
(4) Feature
Engineering
(5) Model selection
+
(6) Performance
Evaluation
(7) Deployment
+
(8) Consumption
Supervised Learning
Regressions:
Linear Regression
Step-wised Regression
Piecewise Polynomials and splines
Smoothing Splines
Logistic Regression
Multivariate Adaptive Regression Splines
Least Absolute Shrinkage and Selection Operator (LASSO)
Ridge Regression
Linear Discriminant Analysis (LDA)
Trees :
Decision trees
Gradient Boosted Regression trees
Adaptive Boosting trees (AdaBoost)
Conditional Inference trees (CI trees)
Bootstrap Aggregation (Bagging) trees
Gradient Boosted Machines(GBM)
Random Forest (RF)
Support Vector Machines (SVM) :
Support vector classifier (two class)
Support vector classifier (multiclass)
Kernels and support vector machines
Dimensionality reduction:
Principal Component Analysis(PCA)
Singular Value Decomposition (SVD)
MinHash
Locality Sensitive Hashing(LSH)
t-Distributed Stochastic Neighbor embedding (t-SNE)
Clustering :
Kmeans Clustering
Hierarchical Clustering
Bradley-Fayyad-Reina (BFR) clustering
Clustering Using REpresentatives CURE clustering
Bayesian networks
Topic modelling
Market Basket :
Apriori (association rules)
Park Chen and Yu algorithm (PCY)
Savasere, Omiecinski and Navathe (SON)
Toivonen’s algorithm
Stream Analysis :
Bloom filters
Flajolet-Martin Algorithm
Alon-Matias-Szegedy
Datar-Gionis-Indyk-Motwani algorithm
Unsupervised Learning
NeuralNetwork families
Deep Learning
Perceptrons
Simple Neural Networks (fully
connected )
Deep Boltzmann machines
Convolutional neural networks
Recurrent neural networks
Genetic algorithm (chromosome)
Multi-arm bandit
K’s Nearest Neighbors (KNN)
Content based recommender
User-User recommender
Item-item recommender
Hybrid recommender
Latent Dirichlet Allocation
Recommender Systems
Others
Others
Supervised Learning
Regressions: What kind of problem it addresses
Linear Regression
Step-wised Regression
Piecewise Polynomials and splines
Smoothing Splines
Logistic Regression
Multivariate Adaptive Regression Splines
Least Absolute Shrinkage and Selection Operator (LASSO)
Ridge Regression
Linear Discriminant Analysis (LDA)
Trees : What kind of problem it address
Decision trees
Gradient Boosted Regression trees
Adaptive Boosting trees (AdaBoost)
Conditional Inference trees (CI trees)
Bootstrap Aggregation (Bagging) trees
Gradient Boosted Machines(GBM)
Random Forest (RF)
Support Vector Machines (SVM) : problem it address
Support vector classifier (two class)
Support vector classifier (multiclass)
Kernels and support vector machines
Dimensionality reduction: type of problems it address
Principal Component Analysis(PCA)
Singular Value Decomposition (SVD)
MinHash
Locality Sensitive Hashing(LSH)
t-Distributed Stochastic Neighbor embedding (t-SNE)
Clustering :
Kmeans Clustering
Hierarchical Clustering
Bradley-Fayyad-Reina (BFR) clustering
Clustering Using REpresentatives CURE clustering
Bayesian networks
Topic modelling
Market Basket : type of problems it address
Apriori (association rules)
Park Chen and Yu algorithm (PCY)
Savasere, Omiecinski and Navathe (SON)
Toivonen’s algorithm
Stream Analysis : problems it address
Bloom filters
Flajolet-Martin Algorithm
Alon-Matias-Szegedy
Datar-Gionis-Indyk-Motwani algorithm
Unsupervised Learning
NeuralNetwork families
Deep Learning
Perceptrons
Simple Neural Networks (fully
connected )
Deep Boltzmann machines
Convolutional neural networks
Recurrent neural networks
Genetic algorithm (chromosome)
Multi-arm bandit
K’s Nearest Neighbors (KNN)
Content based recommender
User-User recommender
Item-item recommender
Hybrid recommender
Latent Dirichlet Allocation
Recommender Systems
Others
Others
How many patients will enter
ICU in a given time ?
Does patient number X having
Breast cancer Yes/No ?
Does patient number X having
Breast cancer Yes/No ?
I have 1200 features (age, gender,
income, diagostic codes, hospital
visits...etc) in my data about the
patients, is there a simplier way to
group those features ?
Diet preference - If I love eating
strawberries, will I also like
raspberries ?
How do i count a live-feed camera
(=stream) # of patients passing by
this check-point ?
(1) Object detection – people
& cars
(2) NLP – respond to a
sentence
Predict rating of hospitals
Frequently used
english letters
Countries /cities ?
Relative positions
?
Frequently used
nouns in a
sentense ?
Frequently used past tensed
verbs
Geographic
regions &
nationalities
Education
facilities
Concept of
good/just ?

More Related Content

Similar to Aiday

Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
 
Machine Learning Classifiers
Machine Learning ClassifiersMachine Learning Classifiers
Machine Learning ClassifiersMostafa
 
Echelon Asia Summit 2017 Startup Academy Workshop
Echelon Asia Summit 2017 Startup Academy WorkshopEchelon Asia Summit 2017 Startup Academy Workshop
Echelon Asia Summit 2017 Startup Academy WorkshopGarrett Teoh Hor Keong
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Greg Makowski
 
Sql on hadoop the secret presentation.3pptx
Sql on hadoop  the secret presentation.3pptxSql on hadoop  the secret presentation.3pptx
Sql on hadoop the secret presentation.3pptxPaulo Alonso
 
The Importance of Data Analysis in Producing a Robust Physical Data Model
The Importance of Data Analysis in Producing a Robust Physical Data ModelThe Importance of Data Analysis in Producing a Robust Physical Data Model
The Importance of Data Analysis in Producing a Robust Physical Data ModelDeclan Chellar
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modelingvivekjv
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarinn5712036
 
Fractional factorial design tutorial
Fractional factorial design tutorialFractional factorial design tutorial
Fractional factorial design tutorialGaurav Kr
 
2020 01 21 Data Platform Geeks - Machine Learning.Net
2020 01 21 Data Platform Geeks - Machine Learning.Net2020 01 21 Data Platform Geeks - Machine Learning.Net
2020 01 21 Data Platform Geeks - Machine Learning.NetBruno Capuano
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modelingaksrauf
 
Contech analyser for_robust_design_v1.6_en
Contech analyser for_robust_design_v1.6_enContech analyser for_robust_design_v1.6_en
Contech analyser for_robust_design_v1.6_enClaudia Herrmann
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine LearningMostafa
 
Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Peter Gfader
 

Similar to Aiday (20)

Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Machine Learning Classifiers
Machine Learning ClassifiersMachine Learning Classifiers
Machine Learning Classifiers
 
Echelon Asia Summit 2017 Startup Academy Workshop
Echelon Asia Summit 2017 Startup Academy WorkshopEchelon Asia Summit 2017 Startup Academy Workshop
Echelon Asia Summit 2017 Startup Academy Workshop
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Production model lifecycle management 2016 09
Production model lifecycle management 2016 09
 
Sql on hadoop the secret presentation.3pptx
Sql on hadoop  the secret presentation.3pptxSql on hadoop  the secret presentation.3pptx
Sql on hadoop the secret presentation.3pptx
 
The Importance of Data Analysis in Producing a Robust Physical Data Model
The Importance of Data Analysis in Producing a Robust Physical Data ModelThe Importance of Data Analysis in Producing a Robust Physical Data Model
The Importance of Data Analysis in Producing a Robust Physical Data Model
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
Fractional factorial design tutorial
Fractional factorial design tutorialFractional factorial design tutorial
Fractional factorial design tutorial
 
2020 01 21 Data Platform Geeks - Machine Learning.Net
2020 01 21 Data Platform Geeks - Machine Learning.Net2020 01 21 Data Platform Geeks - Machine Learning.Net
2020 01 21 Data Platform Geeks - Machine Learning.Net
 
Six_Sigma.pptx
Six_Sigma.pptxSix_Sigma.pptx
Six_Sigma.pptx
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
Cs437 lecture 1-6
Cs437 lecture 1-6Cs437 lecture 1-6
Cs437 lecture 1-6
 
Contech analyser for_robust_design_v1.6_en
Contech analyser for_robust_design_v1.6_enContech analyser for_robust_design_v1.6_en
Contech analyser for_robust_design_v1.6_en
 
Data analytics, a (short) tour
Data analytics, a (short) tourData analytics, a (short) tour
Data analytics, a (short) tour
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008
 

More from Zenodia Charpy

DeepLearning Experiments in Medical Image show case
DeepLearning Experiments in Medical Image show case DeepLearning Experiments in Medical Image show case
DeepLearning Experiments in Medical Image show case Zenodia Charpy
 
how to build a Length of Stay model for a ProofOfConcept project
how to build a Length of Stay model for a ProofOfConcept projecthow to build a Length of Stay model for a ProofOfConcept project
how to build a Length of Stay model for a ProofOfConcept projectZenodia Charpy
 
Tech Day Kista Mässa Stockholm 2018
Tech Day Kista Mässa Stockholm 2018Tech Day Kista Mässa Stockholm 2018
Tech Day Kista Mässa Stockholm 2018Zenodia Charpy
 
Data Science on Azure
Data Science on Azure Data Science on Azure
Data Science on Azure Zenodia Charpy
 
Zenodia TechDays talks Oct 24-25 Stockholm Kistamässan
Zenodia TechDays talks Oct 24-25 Stockholm KistamässanZenodia TechDays talks Oct 24-25 Stockholm Kistamässan
Zenodia TechDays talks Oct 24-25 Stockholm KistamässanZenodia Charpy
 
Datascience and Azure(v1.0)
Datascience and Azure(v1.0)Datascience and Azure(v1.0)
Datascience and Azure(v1.0)Zenodia Charpy
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)Zenodia Charpy
 
Meetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo casesMeetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo casesZenodia Charpy
 

More from Zenodia Charpy (8)

DeepLearning Experiments in Medical Image show case
DeepLearning Experiments in Medical Image show case DeepLearning Experiments in Medical Image show case
DeepLearning Experiments in Medical Image show case
 
how to build a Length of Stay model for a ProofOfConcept project
how to build a Length of Stay model for a ProofOfConcept projecthow to build a Length of Stay model for a ProofOfConcept project
how to build a Length of Stay model for a ProofOfConcept project
 
Tech Day Kista Mässa Stockholm 2018
Tech Day Kista Mässa Stockholm 2018Tech Day Kista Mässa Stockholm 2018
Tech Day Kista Mässa Stockholm 2018
 
Data Science on Azure
Data Science on Azure Data Science on Azure
Data Science on Azure
 
Zenodia TechDays talks Oct 24-25 Stockholm Kistamässan
Zenodia TechDays talks Oct 24-25 Stockholm KistamässanZenodia TechDays talks Oct 24-25 Stockholm Kistamässan
Zenodia TechDays talks Oct 24-25 Stockholm Kistamässan
 
Datascience and Azure(v1.0)
Datascience and Azure(v1.0)Datascience and Azure(v1.0)
Datascience and Azure(v1.0)
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)
 
Meetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo casesMeetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo cases
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Aiday

  • 1. AGENDA - DATA SCIENCE IN PRACTICE • The ”compact” version of data science activities • Data science process breakdown step by step • Vote to see deep / reinforcement learning demo
  • 2. Data Science is an iterative process...
  • 3.
  • 4. SIMPLIFICATION OF DATA SCIENCE PROCESS (1) Business Understanding + (2) Data understanding (3) Data processing + (4) Feature Engineering (5) Model selection + (6) Performance Evaluation (7) Deployment + (8) Consumption
  • 5. BUSINESS UNDERSTANDING & DATA UNDERSTANDING (1) Business Understanding + (2) Data understanding (3) Data processing + (4) Feature Engineering (5) Model selection + (6) Performance Evaluation (7) Deployment + (8) Consumption
  • 6. Data Science is an iterative process, and EVERY decision a Data Scientist made in the process is a Trade-off
  • 8. DATA PROCESSING & FEATURE ENGINEERING (1) Business Understanding + (2) Data understanding (3) Data processing + (4) Feature Engineering (5) Model selection + (6) Performance Evaluation (7) Deployment + (8) Consumption
  • 9. DATA PROCESSING = WRANGLING / TRANSFORMATION - MISSING DATA - REDUCE DIMENSION ( SOM, PCA, SVD...ETC) (1) Business Understanding + (2) Data understanding (3) Data processing + (4) Feature Engineering (5) Model selection + (6) Performance Evaluation (7) Deployment + (8) Consumption
  • 10. Deal with Missing Data Remove/keep ?Exhibit B 
  • 11. DIMENSION REDUCTION – PCA 10 dimensions 2 dimensions Ockham's razor - More things should not be used than are necessary ! Exhibit C
  • 12. FEATURE ENGINEERING WHY FEATURE ENGINEERING ? (1) Business Understanding + (2) Data understanding (3) Data processing + (4) Feature Engineering (5) Model selection + (6) Performance Evaluation (7) Deployment + (8) Consumption
  • 13. explain Correlation with a metaphor Interval of distance Direction to the right A B
  • 14. Observation Interval of distance Direction to the right A B Highly correlated(0.75~1) : Tesla car and Volvo car moving almost at the same speed and toward the same direction Negatively correlated(<0) : Tesla car and Volvo car moving toward different directions Positively correlated (0.5 ~0.75) : Tesla car move a bit faster than Volvo car but they are still both heading at the same direction explain Correlation with a metaphor continued Distance=1-corr  0 Distance=1-corr  0.25- 0.5 Distance=1-corr  0. 5- 1
  • 16. MODEL SELECTION & PERFORMANCE EVALUATION (1) Business Understanding + (2) Data understanding (3) Data processing + (4) Feature Engineering (5) Model selection + (6) Performance Evaluation (7) Deployment + (8) Consumption
  • 17. Q : do you know the answer to the question you asked ? Supervise d learning Regressio ns Classes Unsupervis ed learning Deep learning Clusterin g Associati on analysis Ye s No
  • 18.
  • 19. GENERIC SUPERVISED MACHINE LEARINNG FLOW (1) Business Understanding + (2) Data understanding (3) Data processing + (4) Feature Engineering (5) Model selection + (6) Performance Evaluation (7) Deployment + (8) Consumption
  • 21. MODEL PERFORMANCE EVALUATION USE SUPERVISED LEARNING AS AN EXAMPLE (1) Business Understanding + (2) Data understanding (3) Data processing + (4) Feature Engineering (5) Model selection + (6) Performance Evaluation (7) Deployment + (8) Consumption
  • 22. NAIVE WAY TO LOOK AT IT – ACCURACY ! Just guessing Better than guessing
  • 23. ZOOM-IN, WHAT IS MORE IMPORTANT ? Breast cancer Recurrent (=1) Breat cancer Not recurrent(=0) Breast Cancer Recurrent(=1) True Positive Type II error (False Negative) Breast Cancer Not recurrent(=0) Type I error (False Positive) True Negative PredictedLabel True Label Prediction says that you dont have BreastCancer but you acutlaly DO !!!
  • 24. TWEAK ... False Negative drop from 6  1
  • 25. OTHER IMPORTANT CRITERIAS TO CONSIDER Model performance Practical stuff
  • 26. MODEL DEPLOYMENT & CONSUMPTION CHOICES (1) Business Understanding + (2) Data understanding (3) Data processing + (4) Feature Engineering (5) Model selection + (6) Performance Evaluation (7) Deployment + (8) Consumption
  • 28. DEEP LEARNING & REINFORCEMENT LEARNING DEMO
  • 29. Detect people & car Computer Vision self-navigate Q-learning OpenAI gym Mountain Car -0 env- opencv+ImageAI env- py36_kivy env- unity Crowd simulation (A* algorithm) env- py36_kivy
  • 30. GAN – GENERAL ADVERSIAL NETWORK
  • 32.
  • 33. (1) Business Understanding + (2) Data understanding (3) Data processing + (4) Feature Engineering (5) Model selection + (6) Performance Evaluation (7) Deployment + (8) Consumption
  • 34. Supervised Learning Regressions: Linear Regression Step-wised Regression Piecewise Polynomials and splines Smoothing Splines Logistic Regression Multivariate Adaptive Regression Splines Least Absolute Shrinkage and Selection Operator (LASSO) Ridge Regression Linear Discriminant Analysis (LDA) Trees : Decision trees Gradient Boosted Regression trees Adaptive Boosting trees (AdaBoost) Conditional Inference trees (CI trees) Bootstrap Aggregation (Bagging) trees Gradient Boosted Machines(GBM) Random Forest (RF) Support Vector Machines (SVM) : Support vector classifier (two class) Support vector classifier (multiclass) Kernels and support vector machines Dimensionality reduction: Principal Component Analysis(PCA) Singular Value Decomposition (SVD) MinHash Locality Sensitive Hashing(LSH) t-Distributed Stochastic Neighbor embedding (t-SNE) Clustering : Kmeans Clustering Hierarchical Clustering Bradley-Fayyad-Reina (BFR) clustering Clustering Using REpresentatives CURE clustering Bayesian networks Topic modelling Market Basket : Apriori (association rules) Park Chen and Yu algorithm (PCY) Savasere, Omiecinski and Navathe (SON) Toivonen’s algorithm Stream Analysis : Bloom filters Flajolet-Martin Algorithm Alon-Matias-Szegedy Datar-Gionis-Indyk-Motwani algorithm Unsupervised Learning NeuralNetwork families Deep Learning Perceptrons Simple Neural Networks (fully connected ) Deep Boltzmann machines Convolutional neural networks Recurrent neural networks Genetic algorithm (chromosome) Multi-arm bandit K’s Nearest Neighbors (KNN) Content based recommender User-User recommender Item-item recommender Hybrid recommender Latent Dirichlet Allocation Recommender Systems Others Others
  • 35. Supervised Learning Regressions: What kind of problem it addresses Linear Regression Step-wised Regression Piecewise Polynomials and splines Smoothing Splines Logistic Regression Multivariate Adaptive Regression Splines Least Absolute Shrinkage and Selection Operator (LASSO) Ridge Regression Linear Discriminant Analysis (LDA) Trees : What kind of problem it address Decision trees Gradient Boosted Regression trees Adaptive Boosting trees (AdaBoost) Conditional Inference trees (CI trees) Bootstrap Aggregation (Bagging) trees Gradient Boosted Machines(GBM) Random Forest (RF) Support Vector Machines (SVM) : problem it address Support vector classifier (two class) Support vector classifier (multiclass) Kernels and support vector machines Dimensionality reduction: type of problems it address Principal Component Analysis(PCA) Singular Value Decomposition (SVD) MinHash Locality Sensitive Hashing(LSH) t-Distributed Stochastic Neighbor embedding (t-SNE) Clustering : Kmeans Clustering Hierarchical Clustering Bradley-Fayyad-Reina (BFR) clustering Clustering Using REpresentatives CURE clustering Bayesian networks Topic modelling Market Basket : type of problems it address Apriori (association rules) Park Chen and Yu algorithm (PCY) Savasere, Omiecinski and Navathe (SON) Toivonen’s algorithm Stream Analysis : problems it address Bloom filters Flajolet-Martin Algorithm Alon-Matias-Szegedy Datar-Gionis-Indyk-Motwani algorithm Unsupervised Learning NeuralNetwork families Deep Learning Perceptrons Simple Neural Networks (fully connected ) Deep Boltzmann machines Convolutional neural networks Recurrent neural networks Genetic algorithm (chromosome) Multi-arm bandit K’s Nearest Neighbors (KNN) Content based recommender User-User recommender Item-item recommender Hybrid recommender Latent Dirichlet Allocation Recommender Systems Others Others How many patients will enter ICU in a given time ? Does patient number X having Breast cancer Yes/No ? Does patient number X having Breast cancer Yes/No ? I have 1200 features (age, gender, income, diagostic codes, hospital visits...etc) in my data about the patients, is there a simplier way to group those features ? Diet preference - If I love eating strawberries, will I also like raspberries ? How do i count a live-feed camera (=stream) # of patients passing by this check-point ? (1) Object detection – people & cars (2) NLP – respond to a sentence Predict rating of hospitals
  • 36. Frequently used english letters Countries /cities ? Relative positions ? Frequently used nouns in a sentense ? Frequently used past tensed verbs Geographic regions & nationalities Education facilities Concept of good/just ?

Editor's Notes

  1. Microsoft : https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview
  2. Microsoft : https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview
  3. Microsoft : https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview
  4. Align the two blocks
  5. Why/When do we use PCA ?  when lots of columns are highly correlated  usually a bad practice for regression models such as LDA So PCA does two things (1) reduce dimension (2) remove collinearlity
  6. We have two cars, one tesla car and one volvo car here. During the interval of this distance ( from point A to point B) , we know that these two cars are both moving toward the direction to the right at almost the same ” speed We know that when we observe these two car from point A to point B, we can see that these two car will arrive approximated at the same place and they move alone the path quite simontenously syncronized Now this could be due to that there were a husband and a wife ( both own a car) were driving home together , it could be these two cars were in a racing track It could be completedly coincidential , two strangers were just happen to join together in this road toward the same direction within this observed path A to B Now since we do not been given enough information, we have no idea which of these scenario it is .. The only valid conclusion we could draw from this is that When we observed car tesla car and the volvo car, we know that these two car move together almost syncronized in speed and time ( which translate to the distance they covered is quite similar as well) So if we know the fact that we will eventually get tesla car when we standing at point B, we know that we will also have volvo car there as well when we see the tesla car Now we only need to know one of the car ( either tesla or volvo ) when we are at point B to determined how many distance these two cars covered ( since they arrive at the same point B at almost at the same time.. So we actually can just pick one.. This means that these two cars are positively correlated and their correlation is quite strong , approach to 1 since they are moving toward the same direction quite simuteneously Now think about the fact that we did not know if these two cars happend to move toward the same direction simutenously by accident or if there is some scenarios behind the scene that is yet to discover.. Which means that correlation ( either positively or negatively does not mean causation ) So why is it important for feature engineering to know this ? Oki, so let’s say that we want to know fuel consumption efficiency with cars, we then should NOT take tesla car into cosideration, since tesla car did not even use fuel Hence it will just comfused the model i build, the model could not possibly know why tesla car has only zero as values through and through..when it comes to fuel comsumption Hence it is actually harmful to not carefully select your features
  7. We have two cars, one tesla car and one volvo car here. During the interval of this distance ( from point A to point B) , we know that these two cars are both moving toward the direction to the right at almost the same ” speed We know that when we observe these two car from point A to point B, we can see that these two car will arrive approximated at the same place and they move alone the path quite simontenously syncronized Now this could be due to that there were a husband and a wife ( both own a car) were driving home together , it could be these two cars were in a racing track It could be completedly coincidential , two strangers were just happen to join together in this road toward the same direction within this observed path A to B Now since we do not been given enough information, we have no idea which of these scenario it is .. The only valid conclusion we could draw from this is that When we observed car tesla car and the volvo car, we know that these two car move together almost syncronized in speed and time ( which translate to the distance they covered is quite similar as well) So if we know the fact that we will eventually get tesla car when we standing at point B, we know that we will also have volvo car there as well when we see the tesla car Now we only need to know one of the car ( either tesla or volvo ) when we are at point B to determined how many distance these two cars covered ( since they arrive at the same point B at almost at the same time.. So we actually can just pick one.. This means that these two cars are positively correlated and their correlation is quite strong , approach to 1 since they are moving toward the same direction quite simuteneously Now think about the fact that we did not know if these two cars happend to move toward the same direction simutenously by accident or if there is some scenarios behind the scene that is yet to discover.. Which means that correlation ( either positively or negatively does not mean causation ) So why is it important for feature engineering to know this ? Oki, so let’s say that we want to know fuel consumption efficiency with cars, we then should NOT take tesla car into cosideration, since tesla car did not even use fuel Hence it will just comfused the model i build, the model could not possibly know why tesla car has only zero as values through and through..when it comes to fuel comsumption Hence it is actually harmful to not carefully select your features
  8. https://findcareerinfo.com/cheat-sheets-machine-learning-deep-learning-ai-data-science-maths-sql/
  9. http://www.rachaelmartino.com/2016/10/azure-cloud-based-analytics_99.html
  10. Notes on this slide- Source : http://scott.fortmann-roe.com/docs/BiasVariance.html
  11. For forecasting models Input : datetime + counts ( see below example of DeviceNumber 41274’s daily count)     Date            counts 2015-01-01    520 2015-01-02    319 2015-01-03    389 2015-01-04    355 2015-01-05    437 2015-01-06    333     output & performance: also a time series but with lower and upper bound      Point      Forecast     Lo 80          Hi 80            Lo 95     Hi 95 2018.01   159.8044  23.64662 295.9622 -48.43096 368.0398 2018.02   299.7230 143.70186 455.7441  61.10926 538.3367 2018.03   356.6332 198.41676 514.8496 114.66204 598.6043 2018.04   345.0193 186.61808 503.4206 102.76551 587.2732 2018.05   308.1619 149.19870 467.1251  65.04866 551.2751 2018.06   279.4213 118.15266 440.6899  32.78220 526.0604   Output (Visual)   Note :  it should be evident that this type of model in forecasting cannot ’’predict’’ the top 5 destination since it does not accept any other inputs except for date+Counts       For prediction models : Input : Features of choices   +  label (= correct answer to the question) example below  note that for this type of the model, one needs to feed BOTH the features as well as the ’’answer’’ per record   DeviceNumber        PublicHoliday           year                          month                      hour                                  date2hour                DestinationZoneName 41274                       0                               2014                         6                               3                               2014-06-01 3                    KASTRUP/KO0OPENHAMN(F+L) 41274                       0                               2014                         6                               3                               2014-06-01 3                    MALMO0O 41274                       0                               2014                         6                               4                               2014-06-01 4                    KASTRUP/KO0OPENHAMN(F+L)   Output&preformance : performance usually represented by how ’’accurate’’ the model is predicting per correct class ( given the features fed to it) , i.e  if you have  18 destination to predict , then its accuracy is measured by how well it is answering correctly on all 18 classes  !