SlideShare a Scribd company logo
1 of 6
Distinction between outliers and influential data points
In this section, we learn the distinction between outliers and influential data points. In short:
An outlier is a data point whose response y does not follow the general trend of the rest
of the data.
A data point is influential if it unduly influences any part of a regression analysis, such
as the predicted responses, the estimated slope coefficients, or the hypothesis test results.
Note that — for our purposes — we consider a data point to be an outlier only if it is extreme
with respect to the other y values, not the x values.
One advantage of the case in which we have only one predictor is that we can look at simple
scatter plots in order to identify any outliers and influential data points. Let's take a look at a few
examples that should help to clarify the distinction between the two types of extreme values.

Example #1
Based on the definitions above, do you think the following data set contains any outliers? Or, any
influential data points?

You got it! All of the data points follow the general trend of the rest of the data, so there are no
outliers — for emphasis only — in the y direction. And, none of the data points would appear to
influence the location of the best fitting line.
Example #2
Now, how about this example? Do you think the following data set contains any outliers? Or,
any influential data points?(Wink!Wink!)

Of course! Because the blue data point does not follow the general trend of the rest of the data, it
would be considered an outlier. But, is the blue data point influential? An easy way to determine
if the data point is influential is to find the best fitting line twice — once with the blue data point
included and once the blue data point excluded. The following plot illustrates the two best fitting
lines:

Wow — it's hard to even tell the two estimated regression equations apart! The dashed line
represents the estimated regression equation with the blue data point included, while the solid
line represents the estimated regression equation with the blue data point taken excluded. The
slopes of the two lines are very similar — 5.04 and 5.12, respectively.
and the following output when the blue data point is excluded:

In short, the predicted responses, estimated slope coefficients, and hypothesis test results are not
affected by the inclusion of the blue data point. Therefore, the data point is not deemed
influential. In summary, the blue data point is not influential, but still an outlier.

Example #3
Now, how about this example? Do you think the following data set contains any outliers? Or,
any influential data points?

In this case, the blue data point does follow the general trend of the rest of the data. Therefore, it
is not deemed an outlier here. But, is the blue data point influential? It certainly appears to be far
removed from the rest of the data — in the x direction. Is that sufficient to make the data point
influential?
The following plot illustrates two best fitting lines — one obtained when the blue data point is
included and one obtained when the blue data point is excluded:

Again, it's hard to even tell the two estimated regression equations apart! The dashed line
represents the estimated regression equation with the blue data point included, while the solid
line represents the estimated regression equation with the blue data point taken excluded. The
slopes of the two lines are very similar — 4.93 and 5.12, respectively.

and the following output when the blue data point is excluded:

Here, there are hardly any side effects at all of including the blue data point:
In short, the predicted responses, estimated slope coefficients, and hypothesis test results are not
affected by the inclusion of the blue data point. Therefore, the data point is not deemed
influential. In summary, the blue data point is not influential, nor is it an outlier.
Example #4
One last example! Do you think the following data set contains any outliers? Or, any influential
data points?

That's right — in this case, the blue data point is most certainly an outlier and influential! The
blue data point does follow the general trend of the rest of the data. Below are the two best fitting
lines — one obtained when the blue data point is included and one obtained when the blue data
point is excluded:

are (not surprisingly) substantially different. The dashed line represents the estimated regression
equation with the blue data point included, while the solid line represents the estimated
regression equation with the blue data point taken excluded. The existence of the blue data point
significantly reduces the slope of the regression line — dropping it from 5.12 to 3.32.
and the following output when the blue data point is excluded:

Here, the predicted responses and estimated slope coefficients are clearly affected by the
presence of the blue data point. In this case, the blue data point is deemed both influential and an
outlier.

Summary
The above examples — through the use of simple plots — have highlighted the distinction
between outliers and influential data points. We have seen an example, in which a data point was
an outlier, but not influential. That is, not every outlier strongly influences the regression
analysis. It is your job as a regression analyst to always determine if your regression analysis is
unduly influenced by one or a few data points.
Of course, the easy situation occurs for simple linear regression, when we can rely on simple
scatter plots to elucidate matters. Unfortunately, we don't have that luxury in the case of multiple
linear regression. In that situation, we have to rely on various measures to help us determine
whether a data point is an outlier, influential or both.

More Related Content

What's hot

Representing and generating uncertainty effectively presentatıon
Representing and generating uncertainty effectively presentatıonRepresenting and generating uncertainty effectively presentatıon
Representing and generating uncertainty effectively presentatıonAzdeen Najah
 
Imputation Techniques For Market Research Datasets With Missing Values
Imputation Techniques For Market Research Datasets With Missing Values Imputation Techniques For Market Research Datasets With Missing Values
Imputation Techniques For Market Research Datasets With Missing Values Salford Systems
 
Generalized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects DesignsGeneralized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects Designssmackinnon
 
03 Design of Experiments - Factor prioritization
03 Design of Experiments - Factor prioritization03 Design of Experiments - Factor prioritization
03 Design of Experiments - Factor prioritizationStefan Moser
 
Ch8 Regression Revby Rao
Ch8 Regression Revby RaoCh8 Regression Revby Rao
Ch8 Regression Revby RaoSumit Prajapati
 
Increasing Power without Increasing Sample Size
Increasing Power without Increasing Sample SizeIncreasing Power without Increasing Sample Size
Increasing Power without Increasing Sample Sizesmackinnon
 
Association between-variables
Association between-variablesAssociation between-variables
Association between-variablesBorhan Uddin
 
statistical inference
statistical inference statistical inference
statistical inference BasitShah18
 
Statistical Test
Statistical TestStatistical Test
Statistical Testguestdbf093
 
One-Way ANOVA: Conceptual Foundations
One-Way ANOVA: Conceptual FoundationsOne-Way ANOVA: Conceptual Foundations
One-Way ANOVA: Conceptual Foundationssmackinnon
 
Basics of Structural Equation Modeling
Basics of Structural Equation ModelingBasics of Structural Equation Modeling
Basics of Structural Equation Modelingsmackinnon
 
Introduction to Mediation using SPSS
Introduction to Mediation using SPSSIntroduction to Mediation using SPSS
Introduction to Mediation using SPSSsmackinnon
 
Lecture note 2
Lecture note 2Lecture note 2
Lecture note 2sreenu t
 
Edisons Statistics
Edisons StatisticsEdisons Statistics
Edisons Statisticsteresa_soto
 
A review of statistics
A review of statisticsA review of statistics
A review of statisticsedisonre
 

What's hot (18)

Representing and generating uncertainty effectively presentatıon
Representing and generating uncertainty effectively presentatıonRepresenting and generating uncertainty effectively presentatıon
Representing and generating uncertainty effectively presentatıon
 
Imputation Techniques For Market Research Datasets With Missing Values
Imputation Techniques For Market Research Datasets With Missing Values Imputation Techniques For Market Research Datasets With Missing Values
Imputation Techniques For Market Research Datasets With Missing Values
 
Generalized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects DesignsGeneralized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects Designs
 
03 Design of Experiments - Factor prioritization
03 Design of Experiments - Factor prioritization03 Design of Experiments - Factor prioritization
03 Design of Experiments - Factor prioritization
 
Ch8 Regression Revby Rao
Ch8 Regression Revby RaoCh8 Regression Revby Rao
Ch8 Regression Revby Rao
 
SEM
SEMSEM
SEM
 
P1 Stroop
P1 StroopP1 Stroop
P1 Stroop
 
Increasing Power without Increasing Sample Size
Increasing Power without Increasing Sample SizeIncreasing Power without Increasing Sample Size
Increasing Power without Increasing Sample Size
 
Association between-variables
Association between-variablesAssociation between-variables
Association between-variables
 
statistical inference
statistical inference statistical inference
statistical inference
 
Statistical Test
Statistical TestStatistical Test
Statistical Test
 
One-Way ANOVA: Conceptual Foundations
One-Way ANOVA: Conceptual FoundationsOne-Way ANOVA: Conceptual Foundations
One-Way ANOVA: Conceptual Foundations
 
Basics of Structural Equation Modeling
Basics of Structural Equation ModelingBasics of Structural Equation Modeling
Basics of Structural Equation Modeling
 
Introduction to Mediation using SPSS
Introduction to Mediation using SPSSIntroduction to Mediation using SPSS
Introduction to Mediation using SPSS
 
Lecture note 2
Lecture note 2Lecture note 2
Lecture note 2
 
Les5e ppt 11
Les5e ppt 11Les5e ppt 11
Les5e ppt 11
 
Edisons Statistics
Edisons StatisticsEdisons Statistics
Edisons Statistics
 
A review of statistics
A review of statisticsA review of statistics
A review of statistics
 

Viewers also liked

Formato guion de video
Formato guion de videoFormato guion de video
Formato guion de videoAdriana Ruiz R
 
Nestholma Venture Accelerator
Nestholma Venture AcceleratorNestholma Venture Accelerator
Nestholma Venture Acceleratornestholma
 
National Economic Voting in U.S.
National Economic Voting in U.S.National Economic Voting in U.S.
National Economic Voting in U.S.Jyung-Ho Yang
 
Tierra digital nº 5 extraordinario noviembre 2015
Tierra digital nº 5 extraordinario noviembre 2015Tierra digital nº 5 extraordinario noviembre 2015
Tierra digital nº 5 extraordinario noviembre 2015Ejército de Tierra
 
Futebol nunca mais
Futebol nunca maisFutebol nunca mais
Futebol nunca maisJoão Couto
 
Nerea emakumearen erretratua
Nerea emakumearen erretratuaNerea emakumearen erretratua
Nerea emakumearen erretratuakontakatiluak6a05
 
Koninklijke Schenking verdient 1 miljoen met verkoop vastgoed
Koninklijke Schenking verdient 1 miljoen met verkoop vastgoedKoninklijke Schenking verdient 1 miljoen met verkoop vastgoed
Koninklijke Schenking verdient 1 miljoen met verkoop vastgoedThierry Debels
 
La bozza del "nuovo" Regolamento della Consulta delle Associazioni
La bozza del "nuovo" Regolamento della Consulta delle AssociazioniLa bozza del "nuovo" Regolamento della Consulta delle Associazioni
La bozza del "nuovo" Regolamento della Consulta delle Associazioniredattori
 
Portret in opdracht,olieverf op linnen, 80/80 Saskia Vugts Portretschilder
Portret in opdracht,olieverf op linnen, 80/80 Saskia Vugts PortretschilderPortret in opdracht,olieverf op linnen, 80/80 Saskia Vugts Portretschilder
Portret in opdracht,olieverf op linnen, 80/80 Saskia Vugts PortretschilderSaskia Vugts Portretschilder
 

Viewers also liked (14)

Anillos de potencia
Anillos de potencia Anillos de potencia
Anillos de potencia
 
Pizarra Digital Interactiva en primaria
Pizarra Digital Interactiva en primariaPizarra Digital Interactiva en primaria
Pizarra Digital Interactiva en primaria
 
Formato guion de video
Formato guion de videoFormato guion de video
Formato guion de video
 
Nestholma Venture Accelerator
Nestholma Venture AcceleratorNestholma Venture Accelerator
Nestholma Venture Accelerator
 
Santuak nerea
Santuak nereaSantuak nerea
Santuak nerea
 
National Economic Voting in U.S.
National Economic Voting in U.S.National Economic Voting in U.S.
National Economic Voting in U.S.
 
Tierra digital nº 5 extraordinario noviembre 2015
Tierra digital nº 5 extraordinario noviembre 2015Tierra digital nº 5 extraordinario noviembre 2015
Tierra digital nº 5 extraordinario noviembre 2015
 
Practica 3 Cosmin
Practica 3 CosminPractica 3 Cosmin
Practica 3 Cosmin
 
Alfabeto
AlfabetoAlfabeto
Alfabeto
 
Futebol nunca mais
Futebol nunca maisFutebol nunca mais
Futebol nunca mais
 
Nerea emakumearen erretratua
Nerea emakumearen erretratuaNerea emakumearen erretratua
Nerea emakumearen erretratua
 
Koninklijke Schenking verdient 1 miljoen met verkoop vastgoed
Koninklijke Schenking verdient 1 miljoen met verkoop vastgoedKoninklijke Schenking verdient 1 miljoen met verkoop vastgoed
Koninklijke Schenking verdient 1 miljoen met verkoop vastgoed
 
La bozza del "nuovo" Regolamento della Consulta delle Associazioni
La bozza del "nuovo" Regolamento della Consulta delle AssociazioniLa bozza del "nuovo" Regolamento della Consulta delle Associazioni
La bozza del "nuovo" Regolamento della Consulta delle Associazioni
 
Portret in opdracht,olieverf op linnen, 80/80 Saskia Vugts Portretschilder
Portret in opdracht,olieverf op linnen, 80/80 Saskia Vugts PortretschilderPortret in opdracht,olieverf op linnen, 80/80 Saskia Vugts Portretschilder
Portret in opdracht,olieverf op linnen, 80/80 Saskia Vugts Portretschilder
 

Similar to Distinction between outliers and influential data points w out hyp test

Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Dr Athar Khan
 
Linear regression
Linear regressionLinear regression
Linear regressionDepEd
 
Outlying and Influential Data In Regression Diagnostics .docx
Outlying and Influential Data In Regression Diagnostics .docxOutlying and Influential Data In Regression Diagnostics .docx
Outlying and Influential Data In Regression Diagnostics .docxkarlhennesey
 
The future is uncertain. Some events do have a very small probabil.docx
The future is uncertain. Some events do have a very small probabil.docxThe future is uncertain. Some events do have a very small probabil.docx
The future is uncertain. Some events do have a very small probabil.docxoreo10
 
Unit 5 8614.pptx A_Movie_Review_Pursuit_Of_Happiness
Unit 5 8614.pptx A_Movie_Review_Pursuit_Of_HappinessUnit 5 8614.pptx A_Movie_Review_Pursuit_Of_Happiness
Unit 5 8614.pptx A_Movie_Review_Pursuit_Of_Happinessourbusiness0014
 
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysisAwais Salman
 
Scatter plot- Complete
Scatter plot- CompleteScatter plot- Complete
Scatter plot- CompleteIrfan Yaqoob
 
ders 8 Quantile-Regression.ppt
ders 8 Quantile-Regression.pptders 8 Quantile-Regression.ppt
ders 8 Quantile-Regression.pptErgin Akalpler
 
10 Must-Know Statistical Concepts for Data Scientists.docx
10 Must-Know Statistical Concepts for Data Scientists.docx10 Must-Know Statistical Concepts for Data Scientists.docx
10 Must-Know Statistical Concepts for Data Scientists.docxKin Kan
 
Regression and Classification Analysis
Regression and Classification AnalysisRegression and Classification Analysis
Regression and Classification AnalysisYashIyengar
 
Data Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVAData Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVADerek Kane
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionAntony Raj
 
Basic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxBasic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxAnusuya123
 
Regression.pptx
Regression.pptxRegression.pptx
Regression.pptxHarryPuri
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionAntony Raj
 
You clearly understand the concepts of this assignment. You’ve don.docx
You clearly understand the concepts of this assignment. You’ve don.docxYou clearly understand the concepts of this assignment. You’ve don.docx
You clearly understand the concepts of this assignment. You’ve don.docxjeffevans62972
 

Similar to Distinction between outliers and influential data points w out hyp test (20)

Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Binary OR Binomial logistic regression
Binary OR Binomial logistic regression
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Outlying and Influential Data In Regression Diagnostics .docx
Outlying and Influential Data In Regression Diagnostics .docxOutlying and Influential Data In Regression Diagnostics .docx
Outlying and Influential Data In Regression Diagnostics .docx
 
The future is uncertain. Some events do have a very small probabil.docx
The future is uncertain. Some events do have a very small probabil.docxThe future is uncertain. Some events do have a very small probabil.docx
The future is uncertain. Some events do have a very small probabil.docx
 
Unit 5 8614.pptx A_Movie_Review_Pursuit_Of_Happiness
Unit 5 8614.pptx A_Movie_Review_Pursuit_Of_HappinessUnit 5 8614.pptx A_Movie_Review_Pursuit_Of_Happiness
Unit 5 8614.pptx A_Movie_Review_Pursuit_Of_Happiness
 
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysis
 
Eviews forecasting
Eviews forecastingEviews forecasting
Eviews forecasting
 
Scatter plot- Complete
Scatter plot- CompleteScatter plot- Complete
Scatter plot- Complete
 
ders 8 Quantile-Regression.ppt
ders 8 Quantile-Regression.pptders 8 Quantile-Regression.ppt
ders 8 Quantile-Regression.ppt
 
2-20-04.ppt
2-20-04.ppt2-20-04.ppt
2-20-04.ppt
 
10 Must-Know Statistical Concepts for Data Scientists.docx
10 Must-Know Statistical Concepts for Data Scientists.docx10 Must-Know Statistical Concepts for Data Scientists.docx
10 Must-Know Statistical Concepts for Data Scientists.docx
 
Regression and Classification Analysis
Regression and Classification AnalysisRegression and Classification Analysis
Regression and Classification Analysis
 
Data Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVAData Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVA
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
Basic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxBasic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptx
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Regression.pptx
Regression.pptxRegression.pptx
Regression.pptx
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
You clearly understand the concepts of this assignment. You’ve don.docx
You clearly understand the concepts of this assignment. You’ve don.docxYou clearly understand the concepts of this assignment. You’ve don.docx
You clearly understand the concepts of this assignment. You’ve don.docx
 

Recently uploaded

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Recently uploaded (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 

Distinction between outliers and influential data points w out hyp test

  • 1. Distinction between outliers and influential data points In this section, we learn the distinction between outliers and influential data points. In short: An outlier is a data point whose response y does not follow the general trend of the rest of the data. A data point is influential if it unduly influences any part of a regression analysis, such as the predicted responses, the estimated slope coefficients, or the hypothesis test results. Note that — for our purposes — we consider a data point to be an outlier only if it is extreme with respect to the other y values, not the x values. One advantage of the case in which we have only one predictor is that we can look at simple scatter plots in order to identify any outliers and influential data points. Let's take a look at a few examples that should help to clarify the distinction between the two types of extreme values. Example #1 Based on the definitions above, do you think the following data set contains any outliers? Or, any influential data points? You got it! All of the data points follow the general trend of the rest of the data, so there are no outliers — for emphasis only — in the y direction. And, none of the data points would appear to influence the location of the best fitting line.
  • 2. Example #2 Now, how about this example? Do you think the following data set contains any outliers? Or, any influential data points?(Wink!Wink!) Of course! Because the blue data point does not follow the general trend of the rest of the data, it would be considered an outlier. But, is the blue data point influential? An easy way to determine if the data point is influential is to find the best fitting line twice — once with the blue data point included and once the blue data point excluded. The following plot illustrates the two best fitting lines: Wow — it's hard to even tell the two estimated regression equations apart! The dashed line represents the estimated regression equation with the blue data point included, while the solid line represents the estimated regression equation with the blue data point taken excluded. The slopes of the two lines are very similar — 5.04 and 5.12, respectively.
  • 3. and the following output when the blue data point is excluded: In short, the predicted responses, estimated slope coefficients, and hypothesis test results are not affected by the inclusion of the blue data point. Therefore, the data point is not deemed influential. In summary, the blue data point is not influential, but still an outlier. Example #3 Now, how about this example? Do you think the following data set contains any outliers? Or, any influential data points? In this case, the blue data point does follow the general trend of the rest of the data. Therefore, it is not deemed an outlier here. But, is the blue data point influential? It certainly appears to be far removed from the rest of the data — in the x direction. Is that sufficient to make the data point influential?
  • 4. The following plot illustrates two best fitting lines — one obtained when the blue data point is included and one obtained when the blue data point is excluded: Again, it's hard to even tell the two estimated regression equations apart! The dashed line represents the estimated regression equation with the blue data point included, while the solid line represents the estimated regression equation with the blue data point taken excluded. The slopes of the two lines are very similar — 4.93 and 5.12, respectively. and the following output when the blue data point is excluded: Here, there are hardly any side effects at all of including the blue data point: In short, the predicted responses, estimated slope coefficients, and hypothesis test results are not affected by the inclusion of the blue data point. Therefore, the data point is not deemed influential. In summary, the blue data point is not influential, nor is it an outlier.
  • 5. Example #4 One last example! Do you think the following data set contains any outliers? Or, any influential data points? That's right — in this case, the blue data point is most certainly an outlier and influential! The blue data point does follow the general trend of the rest of the data. Below are the two best fitting lines — one obtained when the blue data point is included and one obtained when the blue data point is excluded: are (not surprisingly) substantially different. The dashed line represents the estimated regression equation with the blue data point included, while the solid line represents the estimated regression equation with the blue data point taken excluded. The existence of the blue data point significantly reduces the slope of the regression line — dropping it from 5.12 to 3.32.
  • 6. and the following output when the blue data point is excluded: Here, the predicted responses and estimated slope coefficients are clearly affected by the presence of the blue data point. In this case, the blue data point is deemed both influential and an outlier. Summary The above examples — through the use of simple plots — have highlighted the distinction between outliers and influential data points. We have seen an example, in which a data point was an outlier, but not influential. That is, not every outlier strongly influences the regression analysis. It is your job as a regression analyst to always determine if your regression analysis is unduly influenced by one or a few data points. Of course, the easy situation occurs for simple linear regression, when we can rely on simple scatter plots to elucidate matters. Unfortunately, we don't have that luxury in the case of multiple linear regression. In that situation, we have to rely on various measures to help us determine whether a data point is an outlier, influential or both.