Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

8

Share

Download to read offline

Data cleaning and screening

Download to read offline

Cairo University, Faculty of commerce, Business administration department, Pre-master class, Methodological studies.

Related Books

Free with a 30 day trial from Scribd

See all

Data cleaning and screening

  1. 1. Mohamed, Hassan Mohamed Hussein Business administration department Faculty of Commerce Cairo University Egypt 2016 Data screening and cleaning
  2. 2. Agenda  Importance.  Data screening steps.  Data cleaning  Missing data  Normality  Linearity  Outliers  Multicollinearity  Homoscedasticity Hassan Mohamed Cairo University- Statistical Package, 2016
  3. 3. Importance. Where you should clean your data in your research process?  Data cleaning and screening is the step that directly follows data entry and you must not start your analysis unless doing it.  Data screening importance:  It is very easy to make mistakes when entering data.  Some errors can miss up your analysis.  So, it is important to spend the time for checking for the mistakes initially, rather than trying to repair the damage later, try another person to check your data. Hassan Mohamed Cairo University- Statistical Package, 2016
  4. 4. Data screening steps 1) Check out the abnormal data (data within out of range) from frequencies table. 2) Go back to the original questionnaire and correct them. Hassan Mohamed Cairo University- Statistical Package, 2016
  5. 5. Data cleaning  Data cleaning includes:  Missing data  Normality  Linearity  Outliers  Multicollinearity  Homoscedasticity Hassan Mohamed Cairo University- Statistical Package, 2016
  6. 6. Missing data - If Missing data comes from data entry:  You can detect it from the frequencies of the variable (missing #)  Then sort your data ascending or descending.  Then you got the IDs of missing values  Go back and try to fill it.  Run your descriptive analysis again. Hassan Mohamed Cairo University- Statistical Package, 2016
  7. 7. Missing data (cont.) - If the data entry comes from respondent errors;  respondent was ambiguous  Respondent forgot to answer the question. • And missing data are more than 10% of the total values of the variable that has missing data. Then don’t treat with the missing data. Hassan Mohamed Cairo University- Statistical Package, 2016
  8. 8. Missing data (cont.) • If the missing values are less than 10%: • You can deal with it: 1. Substitute it with the neutral value. (Malhotra, 2010) 2. Substitute with an imputed value: (hair et al.,2010)  Imputation using only valid data: Exclude cases listwise  Complete data. (Least preferable under 10% of missing data)  All available data. Hassan Mohamed Cairo University- Statistical Package, 2016
  9. 9. Missing data (cont.)  Imputation using known replacement values:  Case substitute.  Hot and Cold Deck imputation (most similar case, or best known value)  Imputation by calculating replacement values: Replace with……  Mean substitution  Regression imputation (prediction equation of the valid data)  This option should never be used, as it can severely distort the results of your analysis. Hassan Mohamed Cairo University- Statistical Package, 2016
  10. 10. Missing data (cont.) Or  Exclude cases pairwise (recommended)  Excludes the case only if they are missing the data required for the specific analysis. But still included in any other analysis. (Pallant, 2011) Hassan Mohamed Cairo University- Statistical Package, 2016
  11. 11. Normality  The shape of the data distribution for an individual metric variable.  Used to describe a symmetrical, bell-shaped curve, which has the greatest frequency of scores in the middle with smaller frequencies towards the extremes  It is a must for any parametric analysis.  Normal distribution can be negligible if the sample size more than 50 respondents. Hassan Mohamed Cairo University- Statistical Package, 2016
  12. 12. Normality (Cont.)  Normality measures:  Kurtosis:  Peakedness (Leptokurtic) or flatness (Platykurtic) of the distribution compared to the normal distribution.  In normal distribution the kurtosis value is zero (allowed to ±10)  Skewness:  The balance of the distribution  Positive distribution (left skewed) or Negative distribution (right skewed).  In normal distribution the skewness value is zero (allowed to ±3)Hassan Mohamed Cairo University- Statistical Package, 2016
  13. 13. Normality (Cont.)  5% Trimmed Mean and mean values.  Kolmogorov-Smirnov and Shapiro-Wilk values are more than 0.05 indicates the normality. But it is very sensitive for the sample size more than 200.  Form the Pell shape in the histogram. Transformation can fix the nonnormal distribution. Hassan Mohamed Cairo University- Statistical Package, 2016
  14. 14. Linearity  It is for multivariate techniques based on correlational measures of association including multiple regression. (hair et al., 2010)  The relationship between the two variables should be linear. This means that when you look at a scatterplot of scores you should see a straight line (roughly), not a curve (Curvilinear). (pallant, 2011).  Transformation can overcome the Curvilinear issue (hair et al., 2010)Hassan Mohamed Cairo University- Statistical Package, 2016
  15. 15. Linearity (cont.)  So, shouldn’t transform your data to avoid non normal distribution If your sample more than 50.  But you should transform the data to avoid curvilinearity. Hassan Mohamed Cairo University- Statistical Package, 2016
  16. 16. Outliers  These are case scores that are extreme and therefore have a much higher impact on the outcome of any statistical analysis.  It is not an error in your data, but it makes your data non representative its population (Income)  Can be detected using Box plots.  Outliers come from: (Hair et al.,2010; Tabachnick & Fidell, 1996)  There was a mistake in data entry (a 6 was entered as 66, etc.)  The missing values code was not specified and missing values are being read as case entries (99 in spss)Hassan Mohamed Cairo University- Statistical Package, 2016
  17. 17. Outliers (cont.)  Outliers come from: (Hair et al.,2010; Tabachnick & Fidell, 1996)  There was a mistake in data entry (a 6 was entered as 66, etc.)  The missing values code was not specified and missing values are being read as case entries (99 in spss)  The outlier is not part of the population from which you intended to sample:  extraordinary event (remove it).  Extraordinary observation (take your decision depending on your valid cases) (close to eliminate)  Neutral value for all variables (close to retain)Hassan Mohamed Cairo University- Statistical Package, 2016
  18. 18. Outliers (cont.)  The outlier is part of the population you wanted but in the distribution it is seen as an extreme case.  In this case you have three choices: 1) delete the extreme cases 2) change the outliers’ scores so that they are still extreme but they fit within a normal distribution (for example: make it a unit larger or smaller than last case that fits in the distribution) 3) if the outliers seem to part of an overall non-normal distribution than a transformation can be done but first check for normality Hassan Mohamed Cairo University- Statistical Package, 2016
  19. 19. Outliers (cont.)  The outliers should be retained to ensure the generalizability of population unless they are not representative the population.  So, again shouldn’t transform your data to avoid non normal distribution If your sample more than 50.  But you should transform the data to avoid outliers. Hassan Mohamed Cairo University- Statistical Package, 2016
  20. 20. Thank You Hassan Mohamed Cairo University- Statistical Package, 2016
  • KapilSahu26

    Apr. 23, 2021
  • LindseyToler

    Aug. 6, 2020
  • stanlyam

    Nov. 3, 2019
  • AnamSheikh123

    Oct. 27, 2018
  • sana7023418111

    Sep. 18, 2018
  • vrindagupta6

    Feb. 12, 2018
  • StephaniePrice33

    May. 30, 2017
  • IbrahimMusa14

    Feb. 20, 2017

Cairo University, Faculty of commerce, Business administration department, Pre-master class, Methodological studies.

Views

Total views

7,194

On Slideshare

0

From embeds

0

Number of embeds

11

Actions

Downloads

196

Shares

0

Comments

0

Likes

8

×