SlideShare a Scribd company logo
1 of 15
Download to read offline
A Guideline
for
Statistical and Machine
Learning
Alexandre Alves, June/12/2014
Define your Goal
Define your Goal
Are you interested on predicting or inferring your data?
Prediction is a black-box method: given values for the features X1, …, Xp, it
predicts the value of the response Y.
Inference is a white-box method: how is the response Y affected as the
features X1, …, Xp change.
Define your Goal
People tend to think they need to predict, but more often than not inference will give
them more insight:
In an advertisement campaign, which media contributed most to sales?
Analyzing a business process failure, which attribute of the process contributes the
most to a negative outcome?
Given an increase in height, what is the expected increase in weight?
You must have a goal in mind in the form of a Question to be answered by means of
analyzing the Observations in your data.
Define the Model
Define the Model
Looking at the Observations, is the Response present in the data?
In a history of fraudulent transactions, the outcome of fraud or not fraud is specified in the
transactions themselves.
If so, then you are looking at a Supervised model, and there is a Response variable.
Or is the Response not in the data?
In a financial market Exchange, which stocks are hot? The trade transactions do not include
a variable specifying if the stock is hot or not hot!
In this case, you are looking at an Unsupervised model.
Supervised Models
Is the Response variable quantitative?
What’s the weight? What’s the price? What’s the income?
You are dealing with a Regression problem.
Or is the Response variable qualitative (categorical)?
Is it fraud? What’s the gender - male or female? What’s the brand - A, B, C?
You are looking into a Classification problem.
Regression Problems
Is there a somewhat linear relationship between the features and the response?
Gas consumption for horsepower.
Fit a Linear model to your Observations.
Is there no clear relationship or form between the features and the response?
Gas consumption for year of the car model.
Prefer a non-parametric method, such Regression Splines and Generalized Additive
Models.
Classification Problems
Is the Response made of only two categories (e.g. yes/no)?
Fit a Logistic regression model to your Observations.
Is there a somewhat linear boundary between the categories of the Response?
Use Linear Discriminant Analysis.
Is there no clear boundary form between the categories, but is the probability distribution of the categories known?
Use a Naive Bayes Classifier.
Otherwise if no clear boundary and distribution is not known:
Use K-Nearest Neighbors.
Unsupervised Models
Unsupervised learning is a relative new field
Is there a desired number of groups or categories?
Hot stocks (financial derivatives) and Not-so Hot
K-Means Clustering
Otherwise if number of groups is not known:
Stocks A an B trend together, stocks C and D trend together, stocks E and F…
Hierarchical Clustering
Train, (and Re-train)
the Model
Assessing the Model
The model is created by fitting the Observations.
The Accuracy of the model must be assessed:
If a regression problem, then measure the mean squared error.
If a classification problem, then measure the error rate.
Being able to measure, now we can try different methods to improve the model:
Leave-k-out of the test data and Cross-Validate.
Bootstrap by resampling.
Improving the Model
The possible findings are:
Change the features used in the Model:
Car color has no correlation to gas consumption, thus remove it from Model.
Change the interaction between the features:
Horsepower to gas consumption is not strictly linear, thus square the horsepower variable.
Change the model:
Low accuracy is a good indication that the selected Model is wrong.
Trade-offs
Models that tend to have high accuracy are hard to interpret and therefore inappropriate for inference
Linear regressions easy to interpret, however have low accuracy.
Support-Vector-Machines are very flexible, however can’t be easily interpreted.
Models that tend to be flexible are less biased, however don’t cope well to variances in the training data
Linear regressions are biased towards a linear form, however cope well with variances to the
training data.
k-NN has no bias, however has high variance as the training data changes.
Flexibility versus Interpretability, Bias versus Variance
–William Deming
“In God we trust, all others bring data.”	

”
–George Box
“All models are wrong, some are useful.”	

”
–Rutherford Roger
“We are drowning in information and
starving for knowledge.”	

”

More Related Content

Viewers also liked

Statistical Machine Learning from Data - Introduction to ...
Statistical Machine Learning from Data - Introduction to ...Statistical Machine Learning from Data - Introduction to ...
Statistical Machine Learning from Data - Introduction to ...butest
 
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...WithTheBest
 
Stanford Statistical Learning
Stanford Statistical LearningStanford Statistical Learning
Stanford Statistical LearningKurt Holst
 
Public Sector Digital: What Does The Future Hold?
Public Sector Digital: What Does The Future Hold?Public Sector Digital: What Does The Future Hold?
Public Sector Digital: What Does The Future Hold?Chris Yiu
 
Statistical Learning from Dialogues for Intelligent Assistants
Statistical Learning from Dialogues for Intelligent AssistantsStatistical Learning from Dialogues for Intelligent Assistants
Statistical Learning from Dialogues for Intelligent AssistantsYun-Nung (Vivian) Chen
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learningmahutte
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnSarah Guido
 
Instrument Calibration
Instrument Calibration Instrument Calibration
Instrument Calibration Ramesh Dham
 
Infographic - MSP AWS Migration
Infographic - MSP AWS MigrationInfographic - MSP AWS Migration
Infographic - MSP AWS MigrationCopperEgg
 
Future Agricultures Consortium overview Sept 2011
Future Agricultures Consortium overview Sept 2011Future Agricultures Consortium overview Sept 2011
Future Agricultures Consortium overview Sept 2011futureagricultures
 
Bluffers guide to Terminology
Bluffers guide to TerminologyBluffers guide to Terminology
Bluffers guide to TerminologyJim Gough
 
Tendències i models de negoci del sector Tèxtil –Moda de José Antonio Guerrero
Tendències i models de negoci del sector Tèxtil –Moda de José Antonio GuerreroTendències i models de negoci del sector Tèxtil –Moda de José Antonio Guerrero
Tendències i models de negoci del sector Tèxtil –Moda de José Antonio Guerrerotex4future
 
Panduan si harka_sebagai_pegawai
Panduan si harka_sebagai_pegawaiPanduan si harka_sebagai_pegawai
Panduan si harka_sebagai_pegawaiWenni Meliana
 
Vancouver executive briefing seminar by csr training institute
Vancouver executive briefing seminar by csr training instituteVancouver executive briefing seminar by csr training institute
Vancouver executive briefing seminar by csr training instituteWayne Dunn
 
Логистика на аутсорс: гибкость в период нестабильности IForum2014
Логистика на аутсорс: гибкость в период нестабильности IForum2014Логистика на аутсорс: гибкость в период нестабильности IForum2014
Логистика на аутсорс: гибкость в период нестабильности IForum2014TOCHKA
 

Viewers also liked (20)

Statistical Machine Learning from Data - Introduction to ...
Statistical Machine Learning from Data - Introduction to ...Statistical Machine Learning from Data - Introduction to ...
Statistical Machine Learning from Data - Introduction to ...
 
Introduction
IntroductionIntroduction
Introduction
 
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
 
Stanford Statistical Learning
Stanford Statistical LearningStanford Statistical Learning
Stanford Statistical Learning
 
Public Sector Digital: What Does The Future Hold?
Public Sector Digital: What Does The Future Hold?Public Sector Digital: What Does The Future Hold?
Public Sector Digital: What Does The Future Hold?
 
Statistical Learning from Dialogues for Intelligent Assistants
Statistical Learning from Dialogues for Intelligent AssistantsStatistical Learning from Dialogues for Intelligent Assistants
Statistical Learning from Dialogues for Intelligent Assistants
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
 
Instrument Calibration
Instrument Calibration Instrument Calibration
Instrument Calibration
 
Formato planeacion
Formato planeacionFormato planeacion
Formato planeacion
 
Infographic - MSP AWS Migration
Infographic - MSP AWS MigrationInfographic - MSP AWS Migration
Infographic - MSP AWS Migration
 
Future Agricultures Consortium overview Sept 2011
Future Agricultures Consortium overview Sept 2011Future Agricultures Consortium overview Sept 2011
Future Agricultures Consortium overview Sept 2011
 
Bluffers guide to Terminology
Bluffers guide to TerminologyBluffers guide to Terminology
Bluffers guide to Terminology
 
Pres eucome 2016_v3
Pres eucome 2016_v3Pres eucome 2016_v3
Pres eucome 2016_v3
 
Tendències i models de negoci del sector Tèxtil –Moda de José Antonio Guerrero
Tendències i models de negoci del sector Tèxtil –Moda de José Antonio GuerreroTendències i models de negoci del sector Tèxtil –Moda de José Antonio Guerrero
Tendències i models de negoci del sector Tèxtil –Moda de José Antonio Guerrero
 
Notam 15-nov-16
Notam 15-nov-16Notam 15-nov-16
Notam 15-nov-16
 
Drip fund
Drip fundDrip fund
Drip fund
 
Panduan si harka_sebagai_pegawai
Panduan si harka_sebagai_pegawaiPanduan si harka_sebagai_pegawai
Panduan si harka_sebagai_pegawai
 
Vancouver executive briefing seminar by csr training institute
Vancouver executive briefing seminar by csr training instituteVancouver executive briefing seminar by csr training institute
Vancouver executive briefing seminar by csr training institute
 
Логистика на аутсорс: гибкость в период нестабильности IForum2014
Логистика на аутсорс: гибкость в период нестабильности IForum2014Логистика на аутсорс: гибкость в период нестабильности IForum2014
Логистика на аутсорс: гибкость в период нестабильности IForum2014
 

Similar to A Guideline to Statistical and Machine Learning

HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with Regoodwintx
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationSara Hooker
 
Data mining - Machine Learning
Data mining - Machine LearningData mining - Machine Learning
Data mining - Machine LearningRupaDutta3
 
CHAPTER 11 LOGISTIC REGRESSION.pptx
CHAPTER 11 LOGISTIC REGRESSION.pptxCHAPTER 11 LOGISTIC REGRESSION.pptx
CHAPTER 11 LOGISTIC REGRESSION.pptxUmaDeviAnanth
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining ProcessMarc Berman
 
Regularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptxRegularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptxMohamed Essam
 
Workbook Project
Workbook ProjectWorkbook Project
Workbook ProjectBrian Ryan
 
Creating an Explainable Machine Learning Algorithm
Creating an Explainable Machine Learning AlgorithmCreating an Explainable Machine Learning Algorithm
Creating an Explainable Machine Learning AlgorithmBill Fite
 
Explainable Machine Learning
Explainable Machine LearningExplainable Machine Learning
Explainable Machine LearningBill Fite
 
Scaling in research
Scaling  in researchScaling  in research
Scaling in researchankitsengar
 
MSL 5080, Methods of Analysis for Business Operations 1 .docx
 MSL 5080, Methods of Analysis for Business Operations 1 .docx MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxgertrudebellgrove
 
MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxMSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxpoulterbarbara
 
Lead Scoring Case Study_Final.pptx
Lead Scoring Case Study_Final.pptxLead Scoring Case Study_Final.pptx
Lead Scoring Case Study_Final.pptxRachnaGoel10
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear RegressionSara Hooker
 
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptMarket Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptEdu4Sure
 

Similar to A Guideline to Statistical and Machine Learning (20)

Machine_Learning.pptx
Machine_Learning.pptxMachine_Learning.pptx
Machine_Learning.pptx
 
Regresión
RegresiónRegresión
Regresión
 
HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with R
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Data mining - Machine Learning
Data mining - Machine LearningData mining - Machine Learning
Data mining - Machine Learning
 
CHAPTER 11 LOGISTIC REGRESSION.pptx
CHAPTER 11 LOGISTIC REGRESSION.pptxCHAPTER 11 LOGISTIC REGRESSION.pptx
CHAPTER 11 LOGISTIC REGRESSION.pptx
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
Regularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptxRegularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptx
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Workbook Project
Workbook ProjectWorkbook Project
Workbook Project
 
Creating an Explainable Machine Learning Algorithm
Creating an Explainable Machine Learning AlgorithmCreating an Explainable Machine Learning Algorithm
Creating an Explainable Machine Learning Algorithm
 
Explainable Machine Learning
Explainable Machine LearningExplainable Machine Learning
Explainable Machine Learning
 
Scaling in research
Scaling  in researchScaling  in research
Scaling in research
 
MSL 5080, Methods of Analysis for Business Operations 1 .docx
 MSL 5080, Methods of Analysis for Business Operations 1 .docx MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docx
 
MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxMSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docx
 
Lead Scoring Case Study_Final.pptx
Lead Scoring Case Study_Final.pptxLead Scoring Case Study_Final.pptx
Lead Scoring Case Study_Final.pptx
 
Econometrics
EconometricsEconometrics
Econometrics
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear Regression
 
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptMarket Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
 
Msd 2018 dec
Msd 2018 decMsd 2018 dec
Msd 2018 dec
 

More from Alexandre de Castro Alves

More from Alexandre de Castro Alves (7)

Developing Modular Systems using OSGi
Developing Modular Systems using OSGiDeveloping Modular Systems using OSGi
Developing Modular Systems using OSGi
 
Speeding up big data with event processing
Speeding up big data with event processingSpeeding up big data with event processing
Speeding up big data with event processing
 
A General Extension System for Event Processing Languages
A General Extension System for Event Processing LanguagesA General Extension System for Event Processing Languages
A General Extension System for Event Processing Languages
 
Ts 4783 1
Ts 4783 1Ts 4783 1
Ts 4783 1
 
Bpel4 Ws 1.1 To Ws Bpel 2.0
Bpel4 Ws 1.1 To Ws Bpel 2.0Bpel4 Ws 1.1 To Ws Bpel 2.0
Bpel4 Ws 1.1 To Ws Bpel 2.0
 
Introduction to OSGi
Introduction to OSGiIntroduction to OSGi
Introduction to OSGi
 
Alves Mea Pch1 Free
Alves Mea Pch1 FreeAlves Mea Pch1 Free
Alves Mea Pch1 Free
 

Recently uploaded

React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 

Recently uploaded (20)

React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 

A Guideline to Statistical and Machine Learning

  • 1. A Guideline for Statistical and Machine Learning Alexandre Alves, June/12/2014
  • 3. Define your Goal Are you interested on predicting or inferring your data? Prediction is a black-box method: given values for the features X1, …, Xp, it predicts the value of the response Y. Inference is a white-box method: how is the response Y affected as the features X1, …, Xp change.
  • 4. Define your Goal People tend to think they need to predict, but more often than not inference will give them more insight: In an advertisement campaign, which media contributed most to sales? Analyzing a business process failure, which attribute of the process contributes the most to a negative outcome? Given an increase in height, what is the expected increase in weight? You must have a goal in mind in the form of a Question to be answered by means of analyzing the Observations in your data.
  • 6. Define the Model Looking at the Observations, is the Response present in the data? In a history of fraudulent transactions, the outcome of fraud or not fraud is specified in the transactions themselves. If so, then you are looking at a Supervised model, and there is a Response variable. Or is the Response not in the data? In a financial market Exchange, which stocks are hot? The trade transactions do not include a variable specifying if the stock is hot or not hot! In this case, you are looking at an Unsupervised model.
  • 7. Supervised Models Is the Response variable quantitative? What’s the weight? What’s the price? What’s the income? You are dealing with a Regression problem. Or is the Response variable qualitative (categorical)? Is it fraud? What’s the gender - male or female? What’s the brand - A, B, C? You are looking into a Classification problem.
  • 8. Regression Problems Is there a somewhat linear relationship between the features and the response? Gas consumption for horsepower. Fit a Linear model to your Observations. Is there no clear relationship or form between the features and the response? Gas consumption for year of the car model. Prefer a non-parametric method, such Regression Splines and Generalized Additive Models.
  • 9. Classification Problems Is the Response made of only two categories (e.g. yes/no)? Fit a Logistic regression model to your Observations. Is there a somewhat linear boundary between the categories of the Response? Use Linear Discriminant Analysis. Is there no clear boundary form between the categories, but is the probability distribution of the categories known? Use a Naive Bayes Classifier. Otherwise if no clear boundary and distribution is not known: Use K-Nearest Neighbors.
  • 10. Unsupervised Models Unsupervised learning is a relative new field Is there a desired number of groups or categories? Hot stocks (financial derivatives) and Not-so Hot K-Means Clustering Otherwise if number of groups is not known: Stocks A an B trend together, stocks C and D trend together, stocks E and F… Hierarchical Clustering
  • 12. Assessing the Model The model is created by fitting the Observations. The Accuracy of the model must be assessed: If a regression problem, then measure the mean squared error. If a classification problem, then measure the error rate. Being able to measure, now we can try different methods to improve the model: Leave-k-out of the test data and Cross-Validate. Bootstrap by resampling.
  • 13. Improving the Model The possible findings are: Change the features used in the Model: Car color has no correlation to gas consumption, thus remove it from Model. Change the interaction between the features: Horsepower to gas consumption is not strictly linear, thus square the horsepower variable. Change the model: Low accuracy is a good indication that the selected Model is wrong.
  • 14. Trade-offs Models that tend to have high accuracy are hard to interpret and therefore inappropriate for inference Linear regressions easy to interpret, however have low accuracy. Support-Vector-Machines are very flexible, however can’t be easily interpreted. Models that tend to be flexible are less biased, however don’t cope well to variances in the training data Linear regressions are biased towards a linear form, however cope well with variances to the training data. k-NN has no bias, however has high variance as the training data changes. Flexibility versus Interpretability, Bias versus Variance
  • 15. –William Deming “In God we trust, all others bring data.” ” –George Box “All models are wrong, some are useful.” ” –Rutherford Roger “We are drowning in information and starving for knowledge.” ”