SlideShare a Scribd company logo
1 of 20
Mohamed, Hassan Mohamed Hussein
Business administration department
Faculty of Commerce
Cairo University
Egypt
2016
Data screening and cleaning
Agenda
 Importance.
 Data screening steps.
 Data cleaning
 Missing data
 Normality
 Linearity
 Outliers
 Multicollinearity
 Homoscedasticity
Hassan Mohamed Cairo University- Statistical
Package, 2016
Importance.
Where you should clean your data in your
research process?
 Data cleaning and screening is the step that directly
follows data entry and you must not start your analysis
unless doing it.
 Data screening importance:
 It is very easy to make mistakes when entering data.
 Some errors can miss up your analysis.
 So, it is important to spend the time for checking for
the mistakes initially, rather than trying to repair the
damage later, try another person to check your data.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Data screening steps
1) Check out the abnormal data (data within out of
range) from frequencies table.
2) Go back to the original questionnaire and
correct them.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Data cleaning
 Data cleaning includes:
 Missing data
 Normality
 Linearity
 Outliers
 Multicollinearity
 Homoscedasticity
Hassan Mohamed Cairo University- Statistical
Package, 2016
Missing data
- If Missing data comes from data entry:
 You can detect it from the frequencies of the variable
(missing #)
 Then sort your data ascending or descending.
 Then you got the IDs of missing values
 Go back and try to fill it.
 Run your descriptive analysis again.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Missing data (cont.)
- If the data entry comes from respondent errors;
 respondent was ambiguous
 Respondent forgot to answer the question.
• And missing data are more than 10% of the total
values of the variable that has missing data. Then
don’t treat with the missing data.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Missing data (cont.)
• If the missing values are less than 10%:
• You can deal with it:
1. Substitute it with the neutral value. (Malhotra, 2010)
2. Substitute with an imputed value: (hair et al.,2010)
 Imputation using only valid data: Exclude cases
listwise
 Complete data. (Least preferable under 10% of
missing data)
 All available data.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Missing data (cont.)
 Imputation using known replacement values:
 Case substitute.
 Hot and Cold Deck imputation (most similar case, or
best known value)
 Imputation by calculating replacement values: Replace
with……
 Mean substitution
 Regression imputation (prediction equation of the
valid data)
 This option should never be used, as it can severely
distort the results of your analysis.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Missing data (cont.)
Or
 Exclude cases pairwise (recommended)
 Excludes the case only if they are missing the
data required for the specific analysis. But still
included in any other analysis. (Pallant, 2011)
Hassan Mohamed Cairo University- Statistical
Package, 2016
Normality
 The shape of the data distribution for an individual
metric variable.
 Used to describe a symmetrical, bell-shaped curve,
which has the greatest frequency of scores in the
middle with smaller frequencies towards the extremes
 It is a must for any parametric analysis.
 Normal distribution can be negligible if the sample size
more than 50 respondents.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Normality (Cont.)
 Normality measures:
 Kurtosis:
 Peakedness (Leptokurtic) or flatness (Platykurtic) of
the distribution compared to the normal distribution.
 In normal distribution the kurtosis value is zero
(allowed to ±10)
 Skewness:
 The balance of the distribution
 Positive distribution (left skewed) or Negative
distribution (right skewed).
 In normal distribution the skewness value is zero
(allowed to ±3)Hassan Mohamed Cairo University- Statistical
Package, 2016
Normality (Cont.)
 5% Trimmed Mean and mean values.
 Kolmogorov-Smirnov and Shapiro-Wilk values are more
than 0.05 indicates the normality. But it is very sensitive
for the sample size more than 200.
 Form the Pell shape in the histogram.
Transformation can fix the nonnormal
distribution.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Linearity
 It is for multivariate techniques based on correlational
measures of association including multiple regression.
(hair et al., 2010)
 The relationship between the two variables should be
linear. This means that when you look at a scatterplot
of scores you should see a straight line (roughly), not
a curve (Curvilinear). (pallant, 2011).
 Transformation can overcome the Curvilinear issue
(hair et al., 2010)Hassan Mohamed Cairo University- Statistical
Package, 2016
Linearity (cont.)
 So, shouldn’t transform your data to avoid non normal
distribution If your sample more than 50.
 But you should transform the data to avoid
curvilinearity.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Outliers
 These are case scores that are extreme and therefore
have a much higher impact on the outcome of any
statistical analysis.
 It is not an error in your data, but it makes your data
non representative its population (Income)
 Can be detected using Box plots.
 Outliers come from: (Hair et al.,2010; Tabachnick &
Fidell, 1996)
 There was a mistake in data entry (a 6 was entered as
66, etc.)
 The missing values code was not specified and missing
values are being read as case entries (99 in spss)Hassan Mohamed Cairo University- Statistical
Package, 2016
Outliers (cont.)
 Outliers come from: (Hair et al.,2010; Tabachnick &
Fidell, 1996)
 There was a mistake in data entry (a 6 was entered as
66, etc.)
 The missing values code was not specified and missing
values are being read as case entries (99 in spss)
 The outlier is not part of the population from which you
intended to sample:
 extraordinary event (remove it).
 Extraordinary observation (take your decision
depending on your valid cases) (close to eliminate)
 Neutral value for all variables (close to retain)Hassan Mohamed Cairo University- Statistical
Package, 2016
Outliers (cont.)
 The outlier is part of the population you wanted but in the
distribution it is seen as an extreme case.
 In this case you have three choices:
1) delete the extreme cases
2) change the outliers’ scores so that they are still extreme
but they fit within a normal distribution (for example: make
it a unit larger or smaller than last case that fits in the
distribution)
3) if the outliers seem to part of an overall non-normal
distribution than a transformation can be done but first
check for normality
Hassan Mohamed Cairo University- Statistical
Package, 2016
Outliers (cont.)
 The outliers should be retained to ensure the
generalizability of population unless they are not
representative the population.
 So, again shouldn’t transform your data to avoid non
normal distribution If your sample more than 50.
 But you should transform the data to avoid outliers.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Thank You
Hassan Mohamed Cairo University- Statistical
Package, 2016

More Related Content

What's hot

Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
Aileen Balbido
 

What's hot (20)

Four data types Data Scientist should know
Four data types Data Scientist should knowFour data types Data Scientist should know
Four data types Data Scientist should know
 
Basic Statistics & Data Analysis
Basic Statistics & Data AnalysisBasic Statistics & Data Analysis
Basic Statistics & Data Analysis
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
"A basic guide to SPSS"
"A basic guide to SPSS""A basic guide to SPSS"
"A basic guide to SPSS"
 
Missing Data and data imputation techniques
Missing Data and data imputation techniquesMissing Data and data imputation techniques
Missing Data and data imputation techniques
 
Data analysis
Data analysisData analysis
Data analysis
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Data Collection (Methods/ Tools/ Techniques), Primary & Secondary Data, Quali...
Data Collection (Methods/ Tools/ Techniques), Primary & Secondary Data, Quali...Data Collection (Methods/ Tools/ Techniques), Primary & Secondary Data, Quali...
Data Collection (Methods/ Tools/ Techniques), Primary & Secondary Data, Quali...
 
Ppt for 1.1 introduction to statistical inference
Ppt for 1.1 introduction to statistical inferencePpt for 1.1 introduction to statistical inference
Ppt for 1.1 introduction to statistical inference
 
Statistical inference concept, procedure of hypothesis testing
Statistical inference   concept, procedure of hypothesis testingStatistical inference   concept, procedure of hypothesis testing
Statistical inference concept, procedure of hypothesis testing
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Categorical data analysis
Categorical data analysisCategorical data analysis
Categorical data analysis
 
Levels of measurement
Levels of measurementLevels of measurement
Levels of measurement
 
Factor analysis
Factor analysisFactor analysis
Factor analysis
 
Introduction to spss
Introduction to spssIntroduction to spss
Introduction to spss
 
Level Of Measurement
Level Of MeasurementLevel Of Measurement
Level Of Measurement
 
Missing data handling
Missing data handlingMissing data handling
Missing data handling
 
Introduction to Statistics - Basic concepts
Introduction to Statistics - Basic conceptsIntroduction to Statistics - Basic concepts
Introduction to Statistics - Basic concepts
 
What is "data"?
What is "data"?What is "data"?
What is "data"?
 
Unit 1 - Statistics (Part 1).pptx
Unit 1 - Statistics (Part 1).pptxUnit 1 - Statistics (Part 1).pptx
Unit 1 - Statistics (Part 1).pptx
 

Viewers also liked

Business Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysisBusiness Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysis
Ahsan Khan Eco (Superior College)
 
Spss lecture notes
Spss lecture notesSpss lecture notes
Spss lecture notes
David mbwiga
 

Viewers also liked (19)

Brief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data CleaningBrief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 
Data Cleaning Process
Data Cleaning ProcessData Cleaning Process
Data Cleaning Process
 
Kofi nyanteng cleaning and screning data using spss
Kofi nyanteng   cleaning and screning data using spssKofi nyanteng   cleaning and screning data using spss
Kofi nyanteng cleaning and screning data using spss
 
Presentation on Data Cleansing
Presentation on Data CleansingPresentation on Data Cleansing
Presentation on Data Cleansing
 
Workshop on SPSS: Basic to Intermediate Level
Workshop on SPSS: Basic to Intermediate LevelWorkshop on SPSS: Basic to Intermediate Level
Workshop on SPSS: Basic to Intermediate Level
 
Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data Warehouse
 
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
 
DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census data
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefine
 
The Cost of Bad (And Clean) Data
The Cost of Bad (And Clean) DataThe Cost of Bad (And Clean) Data
The Cost of Bad (And Clean) Data
 
Capturing and Analyzing Qualitative Data in Surveys
Capturing and Analyzing Qualitative Data in SurveysCapturing and Analyzing Qualitative Data in Surveys
Capturing and Analyzing Qualitative Data in Surveys
 
Business Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysisBusiness Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysis
 
Data Processing
Data ProcessingData Processing
Data Processing
 
Data Processing-Presentation
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentation
 
Analyzing survey data
Analyzing survey dataAnalyzing survey data
Analyzing survey data
 
Spss lecture notes
Spss lecture notesSpss lecture notes
Spss lecture notes
 

Similar to Data cleaning and screening

Statistics  What you Need to KnowIntroductionOften, when peop.docx
Statistics  What you Need to KnowIntroductionOften, when peop.docxStatistics  What you Need to KnowIntroductionOften, when peop.docx
Statistics  What you Need to KnowIntroductionOften, when peop.docx
dessiechisomjj4
 
1. F A Using S P S S1 (Saq.Sav) Q Ti A
1.  F A Using  S P S S1 (Saq.Sav)   Q Ti A1.  F A Using  S P S S1 (Saq.Sav)   Q Ti A
1. F A Using S P S S1 (Saq.Sav) Q Ti A
Zoha Qureshi
 
Factor analysis using SPSS
Factor analysis using SPSSFactor analysis using SPSS
Factor analysis using SPSS
Remas Mohamed
 
Alternatives to t test
Alternatives to t testAlternatives to t test
Alternatives to t test
LONDIWE SHANGE
 
2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final
Brian Lin
 
Need a nonplagiarised paper and a form completed by 1006015 before.docx
Need a nonplagiarised paper and a form completed by 1006015 before.docxNeed a nonplagiarised paper and a form completed by 1006015 before.docx
Need a nonplagiarised paper and a form completed by 1006015 before.docx
lea6nklmattu
 
Chapter 19Basic Quantitative Data AnalysisData Cleaning.docx
Chapter 19Basic Quantitative Data AnalysisData Cleaning.docxChapter 19Basic Quantitative Data AnalysisData Cleaning.docx
Chapter 19Basic Quantitative Data AnalysisData Cleaning.docx
keturahhazelhurst
 

Similar to Data cleaning and screening (20)

Normal Curve in Total Quality Management
Normal Curve in Total Quality ManagementNormal Curve in Total Quality Management
Normal Curve in Total Quality Management
 
Applied statistics part 5
Applied statistics part 5Applied statistics part 5
Applied statistics part 5
 
Statistics  What you Need to KnowIntroductionOften, when peop.docx
Statistics  What you Need to KnowIntroductionOften, when peop.docxStatistics  What you Need to KnowIntroductionOften, when peop.docx
Statistics  What you Need to KnowIntroductionOften, when peop.docx
 
Statistic Project Essay
Statistic Project EssayStatistic Project Essay
Statistic Project Essay
 
Chapter 11
Chapter 11Chapter 11
Chapter 11
 
Univariate Analysis
Univariate AnalysisUnivariate Analysis
Univariate Analysis
 
1. F A Using S P S S1 (Saq.Sav) Q Ti A
1.  F A Using  S P S S1 (Saq.Sav)   Q Ti A1.  F A Using  S P S S1 (Saq.Sav)   Q Ti A
1. F A Using S P S S1 (Saq.Sav) Q Ti A
 
Factor analysis using SPSS
Factor analysis using SPSSFactor analysis using SPSS
Factor analysis using SPSS
 
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating  Scores...SPSS GuideAssessing Normality, Handling Missing Data, and Calculating  Scores...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
 
Data Analysis for Graduate Studies Summary
Data Analysis for Graduate Studies SummaryData Analysis for Graduate Studies Summary
Data Analysis for Graduate Studies Summary
 
Alternatives to t test
Alternatives to t testAlternatives to t test
Alternatives to t test
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...
 
data Sreening.doc
data Sreening.docdata Sreening.doc
data Sreening.doc
 
2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final
 
Machine Learning and Causal Inference
Machine Learning and Causal InferenceMachine Learning and Causal Inference
Machine Learning and Causal Inference
 
Building Better Models
Building Better ModelsBuilding Better Models
Building Better Models
 
Need a nonplagiarised paper and a form completed by 1006015 before.docx
Need a nonplagiarised paper and a form completed by 1006015 before.docxNeed a nonplagiarised paper and a form completed by 1006015 before.docx
Need a nonplagiarised paper and a form completed by 1006015 before.docx
 
MELJUN CORTES research designing_research_methodology
MELJUN CORTES research designing_research_methodologyMELJUN CORTES research designing_research_methodology
MELJUN CORTES research designing_research_methodology
 
Statistics Based On Ncert X Class
Statistics Based On Ncert X ClassStatistics Based On Ncert X Class
Statistics Based On Ncert X Class
 
Chapter 19Basic Quantitative Data AnalysisData Cleaning.docx
Chapter 19Basic Quantitative Data AnalysisData Cleaning.docxChapter 19Basic Quantitative Data AnalysisData Cleaning.docx
Chapter 19Basic Quantitative Data AnalysisData Cleaning.docx
 

Recently uploaded

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 

Recently uploaded (20)

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 

Data cleaning and screening

  • 1. Mohamed, Hassan Mohamed Hussein Business administration department Faculty of Commerce Cairo University Egypt 2016 Data screening and cleaning
  • 2. Agenda  Importance.  Data screening steps.  Data cleaning  Missing data  Normality  Linearity  Outliers  Multicollinearity  Homoscedasticity Hassan Mohamed Cairo University- Statistical Package, 2016
  • 3. Importance. Where you should clean your data in your research process?  Data cleaning and screening is the step that directly follows data entry and you must not start your analysis unless doing it.  Data screening importance:  It is very easy to make mistakes when entering data.  Some errors can miss up your analysis.  So, it is important to spend the time for checking for the mistakes initially, rather than trying to repair the damage later, try another person to check your data. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 4. Data screening steps 1) Check out the abnormal data (data within out of range) from frequencies table. 2) Go back to the original questionnaire and correct them. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 5. Data cleaning  Data cleaning includes:  Missing data  Normality  Linearity  Outliers  Multicollinearity  Homoscedasticity Hassan Mohamed Cairo University- Statistical Package, 2016
  • 6. Missing data - If Missing data comes from data entry:  You can detect it from the frequencies of the variable (missing #)  Then sort your data ascending or descending.  Then you got the IDs of missing values  Go back and try to fill it.  Run your descriptive analysis again. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 7. Missing data (cont.) - If the data entry comes from respondent errors;  respondent was ambiguous  Respondent forgot to answer the question. • And missing data are more than 10% of the total values of the variable that has missing data. Then don’t treat with the missing data. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 8. Missing data (cont.) • If the missing values are less than 10%: • You can deal with it: 1. Substitute it with the neutral value. (Malhotra, 2010) 2. Substitute with an imputed value: (hair et al.,2010)  Imputation using only valid data: Exclude cases listwise  Complete data. (Least preferable under 10% of missing data)  All available data. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 9. Missing data (cont.)  Imputation using known replacement values:  Case substitute.  Hot and Cold Deck imputation (most similar case, or best known value)  Imputation by calculating replacement values: Replace with……  Mean substitution  Regression imputation (prediction equation of the valid data)  This option should never be used, as it can severely distort the results of your analysis. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 10. Missing data (cont.) Or  Exclude cases pairwise (recommended)  Excludes the case only if they are missing the data required for the specific analysis. But still included in any other analysis. (Pallant, 2011) Hassan Mohamed Cairo University- Statistical Package, 2016
  • 11. Normality  The shape of the data distribution for an individual metric variable.  Used to describe a symmetrical, bell-shaped curve, which has the greatest frequency of scores in the middle with smaller frequencies towards the extremes  It is a must for any parametric analysis.  Normal distribution can be negligible if the sample size more than 50 respondents. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 12. Normality (Cont.)  Normality measures:  Kurtosis:  Peakedness (Leptokurtic) or flatness (Platykurtic) of the distribution compared to the normal distribution.  In normal distribution the kurtosis value is zero (allowed to ±10)  Skewness:  The balance of the distribution  Positive distribution (left skewed) or Negative distribution (right skewed).  In normal distribution the skewness value is zero (allowed to ±3)Hassan Mohamed Cairo University- Statistical Package, 2016
  • 13. Normality (Cont.)  5% Trimmed Mean and mean values.  Kolmogorov-Smirnov and Shapiro-Wilk values are more than 0.05 indicates the normality. But it is very sensitive for the sample size more than 200.  Form the Pell shape in the histogram. Transformation can fix the nonnormal distribution. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 14. Linearity  It is for multivariate techniques based on correlational measures of association including multiple regression. (hair et al., 2010)  The relationship between the two variables should be linear. This means that when you look at a scatterplot of scores you should see a straight line (roughly), not a curve (Curvilinear). (pallant, 2011).  Transformation can overcome the Curvilinear issue (hair et al., 2010)Hassan Mohamed Cairo University- Statistical Package, 2016
  • 15. Linearity (cont.)  So, shouldn’t transform your data to avoid non normal distribution If your sample more than 50.  But you should transform the data to avoid curvilinearity. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 16. Outliers  These are case scores that are extreme and therefore have a much higher impact on the outcome of any statistical analysis.  It is not an error in your data, but it makes your data non representative its population (Income)  Can be detected using Box plots.  Outliers come from: (Hair et al.,2010; Tabachnick & Fidell, 1996)  There was a mistake in data entry (a 6 was entered as 66, etc.)  The missing values code was not specified and missing values are being read as case entries (99 in spss)Hassan Mohamed Cairo University- Statistical Package, 2016
  • 17. Outliers (cont.)  Outliers come from: (Hair et al.,2010; Tabachnick & Fidell, 1996)  There was a mistake in data entry (a 6 was entered as 66, etc.)  The missing values code was not specified and missing values are being read as case entries (99 in spss)  The outlier is not part of the population from which you intended to sample:  extraordinary event (remove it).  Extraordinary observation (take your decision depending on your valid cases) (close to eliminate)  Neutral value for all variables (close to retain)Hassan Mohamed Cairo University- Statistical Package, 2016
  • 18. Outliers (cont.)  The outlier is part of the population you wanted but in the distribution it is seen as an extreme case.  In this case you have three choices: 1) delete the extreme cases 2) change the outliers’ scores so that they are still extreme but they fit within a normal distribution (for example: make it a unit larger or smaller than last case that fits in the distribution) 3) if the outliers seem to part of an overall non-normal distribution than a transformation can be done but first check for normality Hassan Mohamed Cairo University- Statistical Package, 2016
  • 19. Outliers (cont.)  The outliers should be retained to ensure the generalizability of population unless they are not representative the population.  So, again shouldn’t transform your data to avoid non normal distribution If your sample more than 50.  But you should transform the data to avoid outliers. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 20. Thank You Hassan Mohamed Cairo University- Statistical Package, 2016