SlideShare a Scribd company logo
A Guideline
for
Statistical and Machine
Learning
Alexandre Alves, June/12/2014
Define your Goal
Define your Goal
Are you interested on predicting or inferring your data?
Prediction is a black-box method: given values for the features X1, …, Xp, it
predicts the value of the response Y.
Inference is a white-box method: how is the response Y affected as the
features X1, …, Xp change.
Define your Goal
People tend to think they need to predict, but more often than not inference will give
them more insight:
In an advertisement campaign, which media contributed most to sales?
Analyzing a business process failure, which attribute of the process contributes the
most to a negative outcome?
Given an increase in height, what is the expected increase in weight?
You must have a goal in mind in the form of a Question to be answered by means of
analyzing the Observations in your data.
Define the Model
Define the Model
Looking at the Observations, is the Response present in the data?
In a history of fraudulent transactions, the outcome of fraud or not fraud is specified in the
transactions themselves.
If so, then you are looking at a Supervised model, and there is a Response variable.
Or is the Response not in the data?
In a financial market Exchange, which stocks are hot? The trade transactions do not include
a variable specifying if the stock is hot or not hot!
In this case, you are looking at an Unsupervised model.
Supervised Models
Is the Response variable quantitative?
What’s the weight? What’s the price? What’s the income?
You are dealing with a Regression problem.
Or is the Response variable qualitative (categorical)?
Is it fraud? What’s the gender - male or female? What’s the brand - A, B, C?
You are looking into a Classification problem.
Regression Problems
Is there a somewhat linear relationship between the features and the response?
Gas consumption for horsepower.
Fit a Linear model to your Observations.
Is there no clear relationship or form between the features and the response?
Gas consumption for year of the car model.
Prefer a non-parametric method, such Regression Splines and Generalized Additive
Models.
Classification Problems
Is the Response made of only two categories (e.g. yes/no)?
Fit a Logistic regression model to your Observations.
Is there a somewhat linear boundary between the categories of the Response?
Use Linear Discriminant Analysis.
Is there no clear boundary form between the categories, but is the probability distribution of the categories known?
Use a Naive Bayes Classifier.
Otherwise if no clear boundary and distribution is not known:
Use K-Nearest Neighbors.
Unsupervised Models
Unsupervised learning is a relative new field
Is there a desired number of groups or categories?
Hot stocks (financial derivatives) and Not-so Hot
K-Means Clustering
Otherwise if number of groups is not known:
Stocks A an B trend together, stocks C and D trend together, stocks E and F…
Hierarchical Clustering
Train, (and Re-train)
the Model
Assessing the Model
The model is created by fitting the Observations.
The Accuracy of the model must be assessed:
If a regression problem, then measure the mean squared error.
If a classification problem, then measure the error rate.
Being able to measure, now we can try different methods to improve the model:
Leave-k-out of the test data and Cross-Validate.
Bootstrap by resampling.
Improving the Model
The possible findings are:
Change the features used in the Model:
Car color has no correlation to gas consumption, thus remove it from Model.
Change the interaction between the features:
Horsepower to gas consumption is not strictly linear, thus square the horsepower variable.
Change the model:
Low accuracy is a good indication that the selected Model is wrong.
Trade-offs
Models that tend to have high accuracy are hard to interpret and therefore inappropriate for inference
Linear regressions easy to interpret, however have low accuracy.
Support-Vector-Machines are very flexible, however can’t be easily interpreted.
Models that tend to be flexible are less biased, however don’t cope well to variances in the training data
Linear regressions are biased towards a linear form, however cope well with variances to the
training data.
k-NN has no bias, however has high variance as the training data changes.
Flexibility versus Interpretability, Bias versus Variance
–William Deming
“In God we trust, all others bring data.”	

”
–George Box
“All models are wrong, some are useful.”	

”
–Rutherford Roger
“We are drowning in information and
starving for knowledge.”	

”

More Related Content

Viewers also liked

Statistical Machine Learning from Data - Introduction to ...
Statistical Machine Learning from Data - Introduction to ...Statistical Machine Learning from Data - Introduction to ...
Statistical Machine Learning from Data - Introduction to ...
butest
 
Introduction
IntroductionIntroduction
Introduction
nep_test_account
 
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
WithTheBest
 
Stanford Statistical Learning
Stanford Statistical LearningStanford Statistical Learning
Stanford Statistical Learning
Kurt Holst
 
Public Sector Digital: What Does The Future Hold?
Public Sector Digital: What Does The Future Hold?Public Sector Digital: What Does The Future Hold?
Public Sector Digital: What Does The Future Hold?
Chris Yiu
 
Statistical Learning from Dialogues for Intelligent Assistants
Statistical Learning from Dialogues for Intelligent AssistantsStatistical Learning from Dialogues for Intelligent Assistants
Statistical Learning from Dialogues for Intelligent Assistants
Yun-Nung (Vivian) Chen
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
mahutte
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
Sarah Guido
 
Instrument Calibration
Instrument Calibration Instrument Calibration
Instrument Calibration
Ramesh Dham
 
Infographic - MSP AWS Migration
Infographic - MSP AWS MigrationInfographic - MSP AWS Migration
Infographic - MSP AWS Migration
CopperEgg
 
Future Agricultures Consortium overview Sept 2011
Future Agricultures Consortium overview Sept 2011Future Agricultures Consortium overview Sept 2011
Future Agricultures Consortium overview Sept 2011
futureagricultures
 
Bluffers guide to Terminology
Bluffers guide to TerminologyBluffers guide to Terminology
Bluffers guide to Terminology
Jim Gough
 
Pres eucome 2016_v3
Pres eucome 2016_v3Pres eucome 2016_v3
Pres eucome 2016_v3
Marco Leonesio
 
Tendències i models de negoci del sector Tèxtil –Moda de José Antonio Guerrero
Tendències i models de negoci del sector Tèxtil –Moda de José Antonio GuerreroTendències i models de negoci del sector Tèxtil –Moda de José Antonio Guerrero
Tendències i models de negoci del sector Tèxtil –Moda de José Antonio Guerrerotex4future
 
Notam 15-nov-16
Notam 15-nov-16Notam 15-nov-16
Notam 15-nov-16
Carlos Carvalho
 
Drip fund
Drip fundDrip fund
Drip fund
Tom Currier
 
Panduan si harka_sebagai_pegawai
Panduan si harka_sebagai_pegawaiPanduan si harka_sebagai_pegawai
Panduan si harka_sebagai_pegawai
Wenni Meliana
 
Vancouver executive briefing seminar by csr training institute
Vancouver executive briefing seminar by csr training instituteVancouver executive briefing seminar by csr training institute
Vancouver executive briefing seminar by csr training institute
Wayne Dunn
 
Логистика на аутсорс: гибкость в период нестабильности IForum2014
Логистика на аутсорс: гибкость в период нестабильности IForum2014Логистика на аутсорс: гибкость в период нестабильности IForum2014
Логистика на аутсорс: гибкость в период нестабильности IForum2014
TOCHKA
 

Viewers also liked (20)

Statistical Machine Learning from Data - Introduction to ...
Statistical Machine Learning from Data - Introduction to ...Statistical Machine Learning from Data - Introduction to ...
Statistical Machine Learning from Data - Introduction to ...
 
Introduction
IntroductionIntroduction
Introduction
 
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
 
Stanford Statistical Learning
Stanford Statistical LearningStanford Statistical Learning
Stanford Statistical Learning
 
Public Sector Digital: What Does The Future Hold?
Public Sector Digital: What Does The Future Hold?Public Sector Digital: What Does The Future Hold?
Public Sector Digital: What Does The Future Hold?
 
Statistical Learning from Dialogues for Intelligent Assistants
Statistical Learning from Dialogues for Intelligent AssistantsStatistical Learning from Dialogues for Intelligent Assistants
Statistical Learning from Dialogues for Intelligent Assistants
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
 
Instrument Calibration
Instrument Calibration Instrument Calibration
Instrument Calibration
 
Formato planeacion
Formato planeacionFormato planeacion
Formato planeacion
 
Infographic - MSP AWS Migration
Infographic - MSP AWS MigrationInfographic - MSP AWS Migration
Infographic - MSP AWS Migration
 
Future Agricultures Consortium overview Sept 2011
Future Agricultures Consortium overview Sept 2011Future Agricultures Consortium overview Sept 2011
Future Agricultures Consortium overview Sept 2011
 
Bluffers guide to Terminology
Bluffers guide to TerminologyBluffers guide to Terminology
Bluffers guide to Terminology
 
Pres eucome 2016_v3
Pres eucome 2016_v3Pres eucome 2016_v3
Pres eucome 2016_v3
 
Tendències i models de negoci del sector Tèxtil –Moda de José Antonio Guerrero
Tendències i models de negoci del sector Tèxtil –Moda de José Antonio GuerreroTendències i models de negoci del sector Tèxtil –Moda de José Antonio Guerrero
Tendències i models de negoci del sector Tèxtil –Moda de José Antonio Guerrero
 
Notam 15-nov-16
Notam 15-nov-16Notam 15-nov-16
Notam 15-nov-16
 
Drip fund
Drip fundDrip fund
Drip fund
 
Panduan si harka_sebagai_pegawai
Panduan si harka_sebagai_pegawaiPanduan si harka_sebagai_pegawai
Panduan si harka_sebagai_pegawai
 
Vancouver executive briefing seminar by csr training institute
Vancouver executive briefing seminar by csr training instituteVancouver executive briefing seminar by csr training institute
Vancouver executive briefing seminar by csr training institute
 
Логистика на аутсорс: гибкость в период нестабильности IForum2014
Логистика на аутсорс: гибкость в период нестабильности IForum2014Логистика на аутсорс: гибкость в период нестабильности IForum2014
Логистика на аутсорс: гибкость в период нестабильности IForum2014
 

Similar to A Guideline to Statistical and Machine Learning

Machine_Learning.pptx
Machine_Learning.pptxMachine_Learning.pptx
Machine_Learning.pptx
VickyKumar131533
 
Regresión
RegresiónRegresión
HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with R
egoodwintx
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
Sara Hooker
 
Data mining - Machine Learning
Data mining - Machine LearningData mining - Machine Learning
Data mining - Machine Learning
RupaDutta3
 
CHAPTER 11 LOGISTIC REGRESSION.pptx
CHAPTER 11 LOGISTIC REGRESSION.pptxCHAPTER 11 LOGISTIC REGRESSION.pptx
CHAPTER 11 LOGISTIC REGRESSION.pptx
UmaDeviAnanth
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
Marc Berman
 
Regularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptxRegularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptx
Mohamed Essam
 
Linear regression
Linear regressionLinear regression
Linear regression
NilanjanaPradhan2
 
Cross Validation Cross ValidationmCross Validation.pptx
Cross Validation Cross ValidationmCross Validation.pptxCross Validation Cross ValidationmCross Validation.pptx
Cross Validation Cross ValidationmCross Validation.pptx
Nishant83346
 
Workbook Project
Workbook ProjectWorkbook Project
Workbook Project
Brian Ryan
 
Creating an Explainable Machine Learning Algorithm
Creating an Explainable Machine Learning AlgorithmCreating an Explainable Machine Learning Algorithm
Creating an Explainable Machine Learning Algorithm
Bill Fite
 
Explainable Machine Learning
Explainable Machine LearningExplainable Machine Learning
Explainable Machine Learning
Bill Fite
 
Scaling in research
Scaling  in researchScaling  in research
Scaling in research
ankitsengar
 
MSL 5080, Methods of Analysis for Business Operations 1 .docx
 MSL 5080, Methods of Analysis for Business Operations 1 .docx MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docx
gertrudebellgrove
 
MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxMSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docx
poulterbarbara
 
Lead Scoring Case Study_Final.pptx
Lead Scoring Case Study_Final.pptxLead Scoring Case Study_Final.pptx
Lead Scoring Case Study_Final.pptx
RachnaGoel10
 
Econometrics
EconometricsEconometrics
Econometrics
Stephanie King
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear Regression
Sara Hooker
 
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptMarket Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
Edu4Sure
 

Similar to A Guideline to Statistical and Machine Learning (20)

Machine_Learning.pptx
Machine_Learning.pptxMachine_Learning.pptx
Machine_Learning.pptx
 
Regresión
RegresiónRegresión
Regresión
 
HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with R
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Data mining - Machine Learning
Data mining - Machine LearningData mining - Machine Learning
Data mining - Machine Learning
 
CHAPTER 11 LOGISTIC REGRESSION.pptx
CHAPTER 11 LOGISTIC REGRESSION.pptxCHAPTER 11 LOGISTIC REGRESSION.pptx
CHAPTER 11 LOGISTIC REGRESSION.pptx
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
Regularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptxRegularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptx
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Cross Validation Cross ValidationmCross Validation.pptx
Cross Validation Cross ValidationmCross Validation.pptxCross Validation Cross ValidationmCross Validation.pptx
Cross Validation Cross ValidationmCross Validation.pptx
 
Workbook Project
Workbook ProjectWorkbook Project
Workbook Project
 
Creating an Explainable Machine Learning Algorithm
Creating an Explainable Machine Learning AlgorithmCreating an Explainable Machine Learning Algorithm
Creating an Explainable Machine Learning Algorithm
 
Explainable Machine Learning
Explainable Machine LearningExplainable Machine Learning
Explainable Machine Learning
 
Scaling in research
Scaling  in researchScaling  in research
Scaling in research
 
MSL 5080, Methods of Analysis for Business Operations 1 .docx
 MSL 5080, Methods of Analysis for Business Operations 1 .docx MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docx
 
MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxMSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docx
 
Lead Scoring Case Study_Final.pptx
Lead Scoring Case Study_Final.pptxLead Scoring Case Study_Final.pptx
Lead Scoring Case Study_Final.pptx
 
Econometrics
EconometricsEconometrics
Econometrics
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear Regression
 
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptMarket Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
 

More from Alexandre de Castro Alves

Developing Modular Systems using OSGi
Developing Modular Systems using OSGiDeveloping Modular Systems using OSGi
Developing Modular Systems using OSGi
Alexandre de Castro Alves
 
Speeding up big data with event processing
Speeding up big data with event processingSpeeding up big data with event processing
Speeding up big data with event processing
Alexandre de Castro Alves
 
A General Extension System for Event Processing Languages
A General Extension System for Event Processing LanguagesA General Extension System for Event Processing Languages
A General Extension System for Event Processing Languages
Alexandre de Castro Alves
 
Ts 4783 1
Ts 4783 1Ts 4783 1
Bpel4 Ws 1.1 To Ws Bpel 2.0
Bpel4 Ws 1.1 To Ws Bpel 2.0Bpel4 Ws 1.1 To Ws Bpel 2.0
Bpel4 Ws 1.1 To Ws Bpel 2.0
Alexandre de Castro Alves
 
Introduction to OSGi
Introduction to OSGiIntroduction to OSGi
Introduction to OSGi
Alexandre de Castro Alves
 
Alves Mea Pch1 Free
Alves Mea Pch1 FreeAlves Mea Pch1 Free
Alves Mea Pch1 Free
Alexandre de Castro Alves
 

More from Alexandre de Castro Alves (7)

Developing Modular Systems using OSGi
Developing Modular Systems using OSGiDeveloping Modular Systems using OSGi
Developing Modular Systems using OSGi
 
Speeding up big data with event processing
Speeding up big data with event processingSpeeding up big data with event processing
Speeding up big data with event processing
 
A General Extension System for Event Processing Languages
A General Extension System for Event Processing LanguagesA General Extension System for Event Processing Languages
A General Extension System for Event Processing Languages
 
Ts 4783 1
Ts 4783 1Ts 4783 1
Ts 4783 1
 
Bpel4 Ws 1.1 To Ws Bpel 2.0
Bpel4 Ws 1.1 To Ws Bpel 2.0Bpel4 Ws 1.1 To Ws Bpel 2.0
Bpel4 Ws 1.1 To Ws Bpel 2.0
 
Introduction to OSGi
Introduction to OSGiIntroduction to OSGi
Introduction to OSGi
 
Alves Mea Pch1 Free
Alves Mea Pch1 FreeAlves Mea Pch1 Free
Alves Mea Pch1 Free
 

Recently uploaded

5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
jpupo2018
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 

Recently uploaded (20)

5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 

A Guideline to Statistical and Machine Learning

  • 1. A Guideline for Statistical and Machine Learning Alexandre Alves, June/12/2014
  • 3. Define your Goal Are you interested on predicting or inferring your data? Prediction is a black-box method: given values for the features X1, …, Xp, it predicts the value of the response Y. Inference is a white-box method: how is the response Y affected as the features X1, …, Xp change.
  • 4. Define your Goal People tend to think they need to predict, but more often than not inference will give them more insight: In an advertisement campaign, which media contributed most to sales? Analyzing a business process failure, which attribute of the process contributes the most to a negative outcome? Given an increase in height, what is the expected increase in weight? You must have a goal in mind in the form of a Question to be answered by means of analyzing the Observations in your data.
  • 6. Define the Model Looking at the Observations, is the Response present in the data? In a history of fraudulent transactions, the outcome of fraud or not fraud is specified in the transactions themselves. If so, then you are looking at a Supervised model, and there is a Response variable. Or is the Response not in the data? In a financial market Exchange, which stocks are hot? The trade transactions do not include a variable specifying if the stock is hot or not hot! In this case, you are looking at an Unsupervised model.
  • 7. Supervised Models Is the Response variable quantitative? What’s the weight? What’s the price? What’s the income? You are dealing with a Regression problem. Or is the Response variable qualitative (categorical)? Is it fraud? What’s the gender - male or female? What’s the brand - A, B, C? You are looking into a Classification problem.
  • 8. Regression Problems Is there a somewhat linear relationship between the features and the response? Gas consumption for horsepower. Fit a Linear model to your Observations. Is there no clear relationship or form between the features and the response? Gas consumption for year of the car model. Prefer a non-parametric method, such Regression Splines and Generalized Additive Models.
  • 9. Classification Problems Is the Response made of only two categories (e.g. yes/no)? Fit a Logistic regression model to your Observations. Is there a somewhat linear boundary between the categories of the Response? Use Linear Discriminant Analysis. Is there no clear boundary form between the categories, but is the probability distribution of the categories known? Use a Naive Bayes Classifier. Otherwise if no clear boundary and distribution is not known: Use K-Nearest Neighbors.
  • 10. Unsupervised Models Unsupervised learning is a relative new field Is there a desired number of groups or categories? Hot stocks (financial derivatives) and Not-so Hot K-Means Clustering Otherwise if number of groups is not known: Stocks A an B trend together, stocks C and D trend together, stocks E and F… Hierarchical Clustering
  • 12. Assessing the Model The model is created by fitting the Observations. The Accuracy of the model must be assessed: If a regression problem, then measure the mean squared error. If a classification problem, then measure the error rate. Being able to measure, now we can try different methods to improve the model: Leave-k-out of the test data and Cross-Validate. Bootstrap by resampling.
  • 13. Improving the Model The possible findings are: Change the features used in the Model: Car color has no correlation to gas consumption, thus remove it from Model. Change the interaction between the features: Horsepower to gas consumption is not strictly linear, thus square the horsepower variable. Change the model: Low accuracy is a good indication that the selected Model is wrong.
  • 14. Trade-offs Models that tend to have high accuracy are hard to interpret and therefore inappropriate for inference Linear regressions easy to interpret, however have low accuracy. Support-Vector-Machines are very flexible, however can’t be easily interpreted. Models that tend to be flexible are less biased, however don’t cope well to variances in the training data Linear regressions are biased towards a linear form, however cope well with variances to the training data. k-NN has no bias, however has high variance as the training data changes. Flexibility versus Interpretability, Bias versus Variance
  • 15. –William Deming “In God we trust, all others bring data.” ” –George Box “All models are wrong, some are useful.” ” –Rutherford Roger “We are drowning in information and starving for knowledge.” ”