SlideShare a Scribd company logo
1 of 20
Download to read offline
Mohamed, Hassan Mohamed Hussein
Business administration department
Faculty of Commerce
Cairo University
Egypt
2016
Data screening and cleaning
Agenda
 Importance.
 Data screening steps.
 Data cleaning
 Missing data
 Normality
 Linearity
 Outliers
 Multicollinearity
 Homoscedasticity
Hassan Mohamed Cairo University- Statistical
Package, 2016
Importance.
Where you should clean your data in your
research process?
 Data cleaning and screening is the step that directly
follows data entry and you must not start your analysis
unless doing it.
 Data screening importance:
 It is very easy to make mistakes when entering data.
 Some errors can miss up your analysis.
 So, it is important to spend the time for checking for
the mistakes initially, rather than trying to repair the
damage later, try another person to check your data.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Data screening steps
1) Check out the abnormal data (data within out of
range) from frequencies table.
2) Go back to the original questionnaire and
correct them.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Data cleaning
 Data cleaning includes:
 Missing data
 Normality
 Linearity
 Outliers
 Multicollinearity
 Homoscedasticity
Hassan Mohamed Cairo University- Statistical
Package, 2016
Missing data
- If Missing data comes from data entry:
 You can detect it from the frequencies of the variable
(missing #)
 Then sort your data ascending or descending.
 Then you got the IDs of missing values
 Go back and try to fill it.
 Run your descriptive analysis again.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Missing data (cont.)
- If the data entry comes from respondent errors;
 respondent was ambiguous
 Respondent forgot to answer the question.
• And missing data are more than 10% of the total
values of the variable that has missing data. Then
don’t treat with the missing data.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Missing data (cont.)
• If the missing values are less than 10%:
• You can deal with it:
1. Substitute it with the neutral value. (Malhotra, 2010)
2. Substitute with an imputed value: (hair et al.,2010)
 Imputation using only valid data: Exclude cases
listwise
 Complete data. (Least preferable under 10% of
missing data)
 All available data.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Missing data (cont.)
 Imputation using known replacement values:
 Case substitute.
 Hot and Cold Deck imputation (most similar case, or
best known value)
 Imputation by calculating replacement values: Replace
with……
 Mean substitution
 Regression imputation (prediction equation of the
valid data)
 This option should never be used, as it can severely
distort the results of your analysis.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Missing data (cont.)
Or
 Exclude cases pairwise (recommended)
 Excludes the case only if they are missing the
data required for the specific analysis. But still
included in any other analysis. (Pallant, 2011)
Hassan Mohamed Cairo University- Statistical
Package, 2016
Normality
 The shape of the data distribution for an individual
metric variable.
 Used to describe a symmetrical, bell-shaped curve,
which has the greatest frequency of scores in the
middle with smaller frequencies towards the extremes
 It is a must for any parametric analysis.
 Normal distribution can be negligible if the sample size
more than 50 respondents.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Normality (Cont.)
 Normality measures:
 Kurtosis:
 Peakedness (Leptokurtic) or flatness (Platykurtic) of
the distribution compared to the normal distribution.
 In normal distribution the kurtosis value is zero
(allowed to ±10)
 Skewness:
 The balance of the distribution
 Positive distribution (left skewed) or Negative
distribution (right skewed).
 In normal distribution the skewness value is zero
(allowed to ±3)Hassan Mohamed Cairo University- Statistical
Package, 2016
Normality (Cont.)
 5% Trimmed Mean and mean values.
 Kolmogorov-Smirnov and Shapiro-Wilk values are more
than 0.05 indicates the normality. But it is very sensitive
for the sample size more than 200.
 Form the Pell shape in the histogram.
Transformation can fix the nonnormal
distribution.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Linearity
 It is for multivariate techniques based on correlational
measures of association including multiple regression.
(hair et al., 2010)
 The relationship between the two variables should be
linear. This means that when you look at a scatterplot
of scores you should see a straight line (roughly), not
a curve (Curvilinear). (pallant, 2011).
 Transformation can overcome the Curvilinear issue
(hair et al., 2010)Hassan Mohamed Cairo University- Statistical
Package, 2016
Linearity (cont.)
 So, shouldn’t transform your data to avoid non normal
distribution If your sample more than 50.
 But you should transform the data to avoid
curvilinearity.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Outliers
 These are case scores that are extreme and therefore
have a much higher impact on the outcome of any
statistical analysis.
 It is not an error in your data, but it makes your data
non representative its population (Income)
 Can be detected using Box plots.
 Outliers come from: (Hair et al.,2010; Tabachnick &
Fidell, 1996)
 There was a mistake in data entry (a 6 was entered as
66, etc.)
 The missing values code was not specified and missing
values are being read as case entries (99 in spss)Hassan Mohamed Cairo University- Statistical
Package, 2016
Outliers (cont.)
 Outliers come from: (Hair et al.,2010; Tabachnick &
Fidell, 1996)
 There was a mistake in data entry (a 6 was entered as
66, etc.)
 The missing values code was not specified and missing
values are being read as case entries (99 in spss)
 The outlier is not part of the population from which you
intended to sample:
 extraordinary event (remove it).
 Extraordinary observation (take your decision
depending on your valid cases) (close to eliminate)
 Neutral value for all variables (close to retain)Hassan Mohamed Cairo University- Statistical
Package, 2016
Outliers (cont.)
 The outlier is part of the population you wanted but in the
distribution it is seen as an extreme case.
 In this case you have three choices:
1) delete the extreme cases
2) change the outliers’ scores so that they are still extreme
but they fit within a normal distribution (for example: make
it a unit larger or smaller than last case that fits in the
distribution)
3) if the outliers seem to part of an overall non-normal
distribution than a transformation can be done but first
check for normality
Hassan Mohamed Cairo University- Statistical
Package, 2016
Outliers (cont.)
 The outliers should be retained to ensure the
generalizability of population unless they are not
representative the population.
 So, again shouldn’t transform your data to avoid non
normal distribution If your sample more than 50.
 But you should transform the data to avoid outliers.
Hassan Mohamed Cairo University- Statistical
Package, 2016
Thank You
Hassan Mohamed Cairo University- Statistical
Package, 2016

More Related Content

What's hot

Statistical inference: Estimation
Statistical inference: EstimationStatistical inference: Estimation
Statistical inference: EstimationParag Shah
 
Types of variables in statistics
Types of variables in statisticsTypes of variables in statistics
Types of variables in statisticsZakaria Hossain
 
Factor Analysis in Research
Factor Analysis in ResearchFactor Analysis in Research
Factor Analysis in ResearchQasim Raza
 
Factor analysis
Factor analysisFactor analysis
Factor analysissaba khan
 
Multiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA IMultiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA IJames Neill
 
Software packages for statistical analysis - SPSS
Software packages for statistical analysis - SPSSSoftware packages for statistical analysis - SPSS
Software packages for statistical analysis - SPSSANAND BALAJI
 
Factor analysis
Factor analysis Factor analysis
Factor analysis Nima
 
Statistical inference
Statistical inferenceStatistical inference
Statistical inferenceJags Jagdish
 
Exploratory factor analysis
Exploratory factor analysisExploratory factor analysis
Exploratory factor analysisJames Neill
 
Statistical inference concept, procedure of hypothesis testing
Statistical inference   concept, procedure of hypothesis testingStatistical inference   concept, procedure of hypothesis testing
Statistical inference concept, procedure of hypothesis testingAmitaChaudhary19
 
Statistics "Descriptive & Inferential"
Statistics "Descriptive & Inferential"Statistics "Descriptive & Inferential"
Statistics "Descriptive & Inferential"Dalia El-Shafei
 
Multiple regression presentation
Multiple regression presentationMultiple regression presentation
Multiple regression presentationCarlo Magno
 
Data Analysis using SPSS: Part 1
Data Analysis using SPSS: Part 1Data Analysis using SPSS: Part 1
Data Analysis using SPSS: Part 1Taddesse Kassahun
 

What's hot (20)

Statistical inference: Estimation
Statistical inference: EstimationStatistical inference: Estimation
Statistical inference: Estimation
 
Types of variables in statistics
Types of variables in statisticsTypes of variables in statistics
Types of variables in statistics
 
Factor Analysis in Research
Factor Analysis in ResearchFactor Analysis in Research
Factor Analysis in Research
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Inferential statistics
Inferential statisticsInferential statistics
Inferential statistics
 
Factor analysis
Factor analysisFactor analysis
Factor analysis
 
Multiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA IMultiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA I
 
Software packages for statistical analysis - SPSS
Software packages for statistical analysis - SPSSSoftware packages for statistical analysis - SPSS
Software packages for statistical analysis - SPSS
 
Inferential Statistics
Inferential StatisticsInferential Statistics
Inferential Statistics
 
Statistical tests
Statistical tests Statistical tests
Statistical tests
 
Factor analysis
Factor analysis Factor analysis
Factor analysis
 
Multivariate Analysis
Multivariate AnalysisMultivariate Analysis
Multivariate Analysis
 
Statistical inference
Statistical inferenceStatistical inference
Statistical inference
 
Exploratory factor analysis
Exploratory factor analysisExploratory factor analysis
Exploratory factor analysis
 
Statistical inference concept, procedure of hypothesis testing
Statistical inference   concept, procedure of hypothesis testingStatistical inference   concept, procedure of hypothesis testing
Statistical inference concept, procedure of hypothesis testing
 
Statistics "Descriptive & Inferential"
Statistics "Descriptive & Inferential"Statistics "Descriptive & Inferential"
Statistics "Descriptive & Inferential"
 
Non-Parametric Tests
Non-Parametric TestsNon-Parametric Tests
Non-Parametric Tests
 
Multiple regression presentation
Multiple regression presentationMultiple regression presentation
Multiple regression presentation
 
Data Analysis using SPSS: Part 1
Data Analysis using SPSS: Part 1Data Analysis using SPSS: Part 1
Data Analysis using SPSS: Part 1
 
(Manual spss)
(Manual spss)(Manual spss)
(Manual spss)
 

Viewers also liked

Brief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data CleaningBrief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data CleaningJennifer Morrow
 
Kofi nyanteng cleaning and screning data using spss
Kofi nyanteng   cleaning and screning data using spssKofi nyanteng   cleaning and screning data using spss
Kofi nyanteng cleaning and screning data using spssKofi Kyeremateng Nyanteng
 
Presentation on Data Cleansing
Presentation on Data CleansingPresentation on Data Cleansing
Presentation on Data Cleansingng8
 
Workshop on SPSS: Basic to Intermediate Level
Workshop on SPSS: Basic to Intermediate LevelWorkshop on SPSS: Basic to Intermediate Level
Workshop on SPSS: Basic to Intermediate LevelHiram Ting
 
Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)Stefan Urbanek
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRamakant Soni
 
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...DataStax
 
DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataRitvvij Parrikh
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineBertram Ludäscher
 
The Cost of Bad (And Clean) Data
The Cost of Bad (And Clean) DataThe Cost of Bad (And Clean) Data
The Cost of Bad (And Clean) DataRingLead
 
Capturing and Analyzing Qualitative Data in Surveys
Capturing and Analyzing Qualitative Data in SurveysCapturing and Analyzing Qualitative Data in Surveys
Capturing and Analyzing Qualitative Data in SurveysPerformance Solutions Corp.
 
Business Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysisBusiness Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysisAhsan Khan Eco (Superior College)
 
Data Processing-Presentation
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentationnibraspk
 
Spss lecture notes
Spss lecture notesSpss lecture notes
Spss lecture notesDavid mbwiga
 

Viewers also liked (19)

Brief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data CleaningBrief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 
Data Cleaning Process
Data Cleaning ProcessData Cleaning Process
Data Cleaning Process
 
Kofi nyanteng cleaning and screning data using spss
Kofi nyanteng   cleaning and screning data using spssKofi nyanteng   cleaning and screning data using spss
Kofi nyanteng cleaning and screning data using spss
 
Presentation on Data Cleansing
Presentation on Data CleansingPresentation on Data Cleansing
Presentation on Data Cleansing
 
Workshop on SPSS: Basic to Intermediate Level
Workshop on SPSS: Basic to Intermediate LevelWorkshop on SPSS: Basic to Intermediate Level
Workshop on SPSS: Basic to Intermediate Level
 
Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data Warehouse
 
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
 
DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census data
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefine
 
The Cost of Bad (And Clean) Data
The Cost of Bad (And Clean) DataThe Cost of Bad (And Clean) Data
The Cost of Bad (And Clean) Data
 
Capturing and Analyzing Qualitative Data in Surveys
Capturing and Analyzing Qualitative Data in SurveysCapturing and Analyzing Qualitative Data in Surveys
Capturing and Analyzing Qualitative Data in Surveys
 
Business Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysisBusiness Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysis
 
Data Processing
Data ProcessingData Processing
Data Processing
 
Data Processing-Presentation
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentation
 
Analyzing survey data
Analyzing survey dataAnalyzing survey data
Analyzing survey data
 
Spss lecture notes
Spss lecture notesSpss lecture notes
Spss lecture notes
 

Similar to Data cleaning and screening

Normal Curve in Total Quality Management
Normal Curve in Total Quality ManagementNormal Curve in Total Quality Management
Normal Curve in Total Quality ManagementDr.Raja R
 
Statistics  What you Need to KnowIntroductionOften, when peop.docx
Statistics  What you Need to KnowIntroductionOften, when peop.docxStatistics  What you Need to KnowIntroductionOften, when peop.docx
Statistics  What you Need to KnowIntroductionOften, when peop.docxdessiechisomjj4
 
Statistic Project Essay
Statistic Project EssayStatistic Project Essay
Statistic Project EssayRobin Anderson
 
Basic Statistics Essay Examples
Basic Statistics Essay ExamplesBasic Statistics Essay Examples
Basic Statistics Essay ExamplesHelp Paper UK
 
Ois-Quiz Study For Chapter 8 And 9
Ois-Quiz Study For Chapter 8 And 9Ois-Quiz Study For Chapter 8 And 9
Ois-Quiz Study For Chapter 8 And 9Rebecca Harris
 
Factor analysis using SPSS
Factor analysis using SPSSFactor analysis using SPSS
Factor analysis using SPSSRemas Mohamed
 
1. F A Using S P S S1 (Saq.Sav) Q Ti A
1.  F A Using  S P S S1 (Saq.Sav)   Q Ti A1.  F A Using  S P S S1 (Saq.Sav)   Q Ti A
1. F A Using S P S S1 (Saq.Sav) Q Ti AZoha Qureshi
 
Inferential Statistics In Business
Inferential Statistics In BusinessInferential Statistics In Business
Inferential Statistics In BusinessAngie Willis
 
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating  Scores...SPSS GuideAssessing Normality, Handling Missing Data, and Calculating  Scores...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...ahmedragab433449
 
Data Analysis for Graduate Studies Summary
Data Analysis for Graduate Studies SummaryData Analysis for Graduate Studies Summary
Data Analysis for Graduate Studies SummaryKelvinNMhina
 
l develop an analysis in five main including.docx
l develop an analysis in five main including.docxl develop an analysis in five main including.docx
l develop an analysis in five main including.docxwrite22
 
Alternatives to t test
Alternatives to t testAlternatives to t test
Alternatives to t testLONDIWE SHANGE
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...jemille6
 
data Sreening.doc
data Sreening.docdata Sreening.doc
data Sreening.docmurtaza5500
 
2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - FinalBrian Lin
 

Similar to Data cleaning and screening (20)

Normal Curve in Total Quality Management
Normal Curve in Total Quality ManagementNormal Curve in Total Quality Management
Normal Curve in Total Quality Management
 
Applied statistics part 5
Applied statistics part 5Applied statistics part 5
Applied statistics part 5
 
Statistics  What you Need to KnowIntroductionOften, when peop.docx
Statistics  What you Need to KnowIntroductionOften, when peop.docxStatistics  What you Need to KnowIntroductionOften, when peop.docx
Statistics  What you Need to KnowIntroductionOften, when peop.docx
 
What Is Statistics
What Is StatisticsWhat Is Statistics
What Is Statistics
 
Statistic Project Essay
Statistic Project EssayStatistic Project Essay
Statistic Project Essay
 
Basic Statistics Essay Examples
Basic Statistics Essay ExamplesBasic Statistics Essay Examples
Basic Statistics Essay Examples
 
Chapter 11
Chapter 11Chapter 11
Chapter 11
 
Ois-Quiz Study For Chapter 8 And 9
Ois-Quiz Study For Chapter 8 And 9Ois-Quiz Study For Chapter 8 And 9
Ois-Quiz Study For Chapter 8 And 9
 
Univariate Analysis
Univariate AnalysisUnivariate Analysis
Univariate Analysis
 
Factor analysis using SPSS
Factor analysis using SPSSFactor analysis using SPSS
Factor analysis using SPSS
 
1. F A Using S P S S1 (Saq.Sav) Q Ti A
1.  F A Using  S P S S1 (Saq.Sav)   Q Ti A1.  F A Using  S P S S1 (Saq.Sav)   Q Ti A
1. F A Using S P S S1 (Saq.Sav) Q Ti A
 
Inferential Statistics In Business
Inferential Statistics In BusinessInferential Statistics In Business
Inferential Statistics In Business
 
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating  Scores...SPSS GuideAssessing Normality, Handling Missing Data, and Calculating  Scores...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
 
Data Analysis for Graduate Studies Summary
Data Analysis for Graduate Studies SummaryData Analysis for Graduate Studies Summary
Data Analysis for Graduate Studies Summary
 
l develop an analysis in five main including.docx
l develop an analysis in five main including.docxl develop an analysis in five main including.docx
l develop an analysis in five main including.docx
 
Alternatives to t test
Alternatives to t testAlternatives to t test
Alternatives to t test
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...
 
Statistic Manager
Statistic ManagerStatistic Manager
Statistic Manager
 
data Sreening.doc
data Sreening.docdata Sreening.doc
data Sreening.doc
 
2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final
 

Recently uploaded

SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 

Recently uploaded (17)

SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 

Data cleaning and screening

  • 1. Mohamed, Hassan Mohamed Hussein Business administration department Faculty of Commerce Cairo University Egypt 2016 Data screening and cleaning
  • 2. Agenda  Importance.  Data screening steps.  Data cleaning  Missing data  Normality  Linearity  Outliers  Multicollinearity  Homoscedasticity Hassan Mohamed Cairo University- Statistical Package, 2016
  • 3. Importance. Where you should clean your data in your research process?  Data cleaning and screening is the step that directly follows data entry and you must not start your analysis unless doing it.  Data screening importance:  It is very easy to make mistakes when entering data.  Some errors can miss up your analysis.  So, it is important to spend the time for checking for the mistakes initially, rather than trying to repair the damage later, try another person to check your data. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 4. Data screening steps 1) Check out the abnormal data (data within out of range) from frequencies table. 2) Go back to the original questionnaire and correct them. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 5. Data cleaning  Data cleaning includes:  Missing data  Normality  Linearity  Outliers  Multicollinearity  Homoscedasticity Hassan Mohamed Cairo University- Statistical Package, 2016
  • 6. Missing data - If Missing data comes from data entry:  You can detect it from the frequencies of the variable (missing #)  Then sort your data ascending or descending.  Then you got the IDs of missing values  Go back and try to fill it.  Run your descriptive analysis again. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 7. Missing data (cont.) - If the data entry comes from respondent errors;  respondent was ambiguous  Respondent forgot to answer the question. • And missing data are more than 10% of the total values of the variable that has missing data. Then don’t treat with the missing data. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 8. Missing data (cont.) • If the missing values are less than 10%: • You can deal with it: 1. Substitute it with the neutral value. (Malhotra, 2010) 2. Substitute with an imputed value: (hair et al.,2010)  Imputation using only valid data: Exclude cases listwise  Complete data. (Least preferable under 10% of missing data)  All available data. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 9. Missing data (cont.)  Imputation using known replacement values:  Case substitute.  Hot and Cold Deck imputation (most similar case, or best known value)  Imputation by calculating replacement values: Replace with……  Mean substitution  Regression imputation (prediction equation of the valid data)  This option should never be used, as it can severely distort the results of your analysis. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 10. Missing data (cont.) Or  Exclude cases pairwise (recommended)  Excludes the case only if they are missing the data required for the specific analysis. But still included in any other analysis. (Pallant, 2011) Hassan Mohamed Cairo University- Statistical Package, 2016
  • 11. Normality  The shape of the data distribution for an individual metric variable.  Used to describe a symmetrical, bell-shaped curve, which has the greatest frequency of scores in the middle with smaller frequencies towards the extremes  It is a must for any parametric analysis.  Normal distribution can be negligible if the sample size more than 50 respondents. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 12. Normality (Cont.)  Normality measures:  Kurtosis:  Peakedness (Leptokurtic) or flatness (Platykurtic) of the distribution compared to the normal distribution.  In normal distribution the kurtosis value is zero (allowed to ±10)  Skewness:  The balance of the distribution  Positive distribution (left skewed) or Negative distribution (right skewed).  In normal distribution the skewness value is zero (allowed to ±3)Hassan Mohamed Cairo University- Statistical Package, 2016
  • 13. Normality (Cont.)  5% Trimmed Mean and mean values.  Kolmogorov-Smirnov and Shapiro-Wilk values are more than 0.05 indicates the normality. But it is very sensitive for the sample size more than 200.  Form the Pell shape in the histogram. Transformation can fix the nonnormal distribution. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 14. Linearity  It is for multivariate techniques based on correlational measures of association including multiple regression. (hair et al., 2010)  The relationship between the two variables should be linear. This means that when you look at a scatterplot of scores you should see a straight line (roughly), not a curve (Curvilinear). (pallant, 2011).  Transformation can overcome the Curvilinear issue (hair et al., 2010)Hassan Mohamed Cairo University- Statistical Package, 2016
  • 15. Linearity (cont.)  So, shouldn’t transform your data to avoid non normal distribution If your sample more than 50.  But you should transform the data to avoid curvilinearity. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 16. Outliers  These are case scores that are extreme and therefore have a much higher impact on the outcome of any statistical analysis.  It is not an error in your data, but it makes your data non representative its population (Income)  Can be detected using Box plots.  Outliers come from: (Hair et al.,2010; Tabachnick & Fidell, 1996)  There was a mistake in data entry (a 6 was entered as 66, etc.)  The missing values code was not specified and missing values are being read as case entries (99 in spss)Hassan Mohamed Cairo University- Statistical Package, 2016
  • 17. Outliers (cont.)  Outliers come from: (Hair et al.,2010; Tabachnick & Fidell, 1996)  There was a mistake in data entry (a 6 was entered as 66, etc.)  The missing values code was not specified and missing values are being read as case entries (99 in spss)  The outlier is not part of the population from which you intended to sample:  extraordinary event (remove it).  Extraordinary observation (take your decision depending on your valid cases) (close to eliminate)  Neutral value for all variables (close to retain)Hassan Mohamed Cairo University- Statistical Package, 2016
  • 18. Outliers (cont.)  The outlier is part of the population you wanted but in the distribution it is seen as an extreme case.  In this case you have three choices: 1) delete the extreme cases 2) change the outliers’ scores so that they are still extreme but they fit within a normal distribution (for example: make it a unit larger or smaller than last case that fits in the distribution) 3) if the outliers seem to part of an overall non-normal distribution than a transformation can be done but first check for normality Hassan Mohamed Cairo University- Statistical Package, 2016
  • 19. Outliers (cont.)  The outliers should be retained to ensure the generalizability of population unless they are not representative the population.  So, again shouldn’t transform your data to avoid non normal distribution If your sample more than 50.  But you should transform the data to avoid outliers. Hassan Mohamed Cairo University- Statistical Package, 2016
  • 20. Thank You Hassan Mohamed Cairo University- Statistical Package, 2016