DATA ENTRY USING IBM SPSS Yusuf O.B. Biostatistician KAIMRC-WR
Lecture Outline• What is SPSS• Uses of SPSS• Preparing to enter data• Preparing a data dictionary• Data Structures• Errors in data• Data Cleaning
What is SPSSStatistical Package for the Social Sciences
Uses of SPSS• Data entry• Data cleaning and editing• Data analysis• Data presentation• Data Importing and Exporting• SPSS Data Library
Preparing a Data Dictionary• What is a data dictionary? – A book/document containing all variables and the codes/categories assigned to them – Also contains how the variables will be entered and other remarks necessary – Specifies width/ length of variables – Specifies how missing values will be assigned
Data coding• Translation of responses on the questionnaires or data collection sheets to specific categories for the purpose of analysis.• Assignment of numbers to the various levels of the variables.• Load of work light for pre-coded questionnaires
• Important and tedious for open ended questions.
• Need to assign numerical codes to categorical data before entering• For example, you may choose to assign codes of 1, 2, 3 and 4 to categories of “no pain”, „mild pain”, “moderate pain” and “severe pain” respectively
• These codes can be put in the questionnaire when collecting the data.• For binary data e.g. yes/no answers, it is often convenient to assign codes 1 (e.g. for yes) and 0 or 2 (for no).
NEED FOR CODING GUIDE/Data Dictionary• Prepare data in format to allow use of computers for statistical analysis.• Prepare code book or data dictionary for the questionnaire.• Specify range of values expected.
• Unit of measurement should be consistent for all observations on a variable. E.g. weight should be recorded in kg or in pounds , but not both interchangeably• Time: days? Hours? – For example length of hospital stay
Example of data dictionaryVariable Variable label Value labels width RemarkName1. Age AGE --------- 2 Missing=992. Sex SEX 1=male 1 ----- 2=female3. Do you SMOKE 1=YES 1 Missing=9smoke? 2=NO
Example• Topic: Smoking among medical students• 200 questionnaires/records• 6 questions/variables
Variables in questionnaire• Serial Number• Age• Place of residence 1=on campus, 2=off campus• Sex 1=male, 2= female• Do you smoke? 1=yes 2= No
SPSS windows• Data editor – For data entry – For statistical analysis• Viewer – Results are displayed
Data Editor• Two views – Data view: for data entry – Variable view: to define variable characteristics
Preparing Data Structures in SPSS• Variable views – Variable names – Variable types – Value labels – Variable width – Column – Measure
Data Entry• Use of computer packages such as SPSS – Improves the accuracy and speed of data analysis • Makes it easy to check for errors, produces graphical summaries and generates new variables - Log in data as it arrives• Frequent backing-up
• Problems with dates and times: – dates and times should be entered in a consistent manner, e.g. as day/month/year or month/day/year but not interchangeably. – It is important to find out what format the statistical package can read
Handling missing data• Consider what to do with missing values before data is entered.• In most cases, need to use some symbol to represent a missing value• Statistical packages deal with missing values in different ways• Some use special characters (e.g. a full stop or asterisk) to indicate missing values, whereas others require you to define your own code for a missing value (commonly used values are 9, 99 or 999)• The value that is chosen should be one that is not possible for that variable
• For example when entering a categorical variable with four categories (coded 1,2,3, and 4), you may choose the value 9 to represent missing values.• However, if the variable is age of a child, then a different code should be chosen.
• If a large proportion of data is missing, then the results are likely to be unreliable• Reasons why data are missing should always be investigated: how much is missing and why?• If missing data tend to cluster around a particular variable, or in a particular sub group of individuals, then it ,may indicate that the variable is not applicable or has not been measured for that group of individuals
• Then the group of individuals should be excluded from any analysis on that variable• Or it may be that the data is simply sitting on a piece of paper in someone‟s drawer and are yet to be entered!
Errors in data• In any study, there is always the potential for errors to occur in a data set, either at the outset when taking the measurements, or when collecting, and entering data onto a computer• It is hard to eliminate all of these errors• But one can reduce the number of typing errors by checking the data carefully once they have been entered.
Common sources of error• „not applicable‟ or „blank‟ coded as “0”• typing errors on data entry- 18 INSTEAD of 81• column shift- data for one variable column was entered under the adjacent column• coding errors• Loss of concentration
Data cleaningTwo-step process-• detection and• correction of errors.
Detecting errors- Check for completeness and correctness of records. - Indicate admissible values during data entry - Range checks-permissible responses. - Statistical editing
How to Detect Errors via Statistical editing• Produce descriptive statistics for all variables.• Check frequency distribution of each variable 1=male, 2=female, 3?• Standard deviation higher than mean; check for outlying observation
Quality Control- Record verification (double entry) - Does not rule out the possibility that the same error has been incorrectly entered on the two occasions - Disadvantage of this approach is that it takes twice as long to enter the data, which may have major cost or time implications - Creating check files- Random checking: selection at random but should represent all forms being entered
Error checking• Categorical Data: relatively easy, values not allowable must be errors• Check frequency distribution of each variable 1=male, 2=female, 3?• Numerical data: Produce descriptive statistics for all variables.• Standard deviation higher than mean; check for outlying observation• range checks, upper and lower limits can be specified for each variable
• Dates: not easy to check accuracy of dates, for example 30th feb. must be incorrect, any day of the month greater than 31, any month greater than 12• Apply logical checks: – date of birth should correspond to patient‟s age – subjects should usually have been born before entering the study( at least in most studies) – patients who have died should not appear on subsequent follow up visits – there should be no pregnant men
• With all error checks, a value should only be corrected if there is evidence that a mistake has been made• Do not change values simply because they look unusual; investigate
Summary• Experience comes with practice• Input influences output