RSS 2012 Data Entry SPSS

DATA ENTRY USING IBM SPSS

Yusuf O.B.
Biostatistician
KAIMRC-WR

Lecture Outline
• What is SPSS
• Uses of SPSS
• Preparing to enter data
• Preparing a data dictionary
• Data Structures
• Errors in data
• Data Cleaning

What is SPSS

Statistical Package for the Social Sciences

Uses of SPSS

• Data entry
• Data cleaning and editing
• Data analysis
• Data presentation
• Data Importing and Exporting
• SPSS Data Library

Preparing a Data Dictionary
• What is a data dictionary?
– A book/document containing all variables and
the codes/categories assigned to them
– Also contains how the variables will be
entered and other remarks necessary
– Specifies width/ length of variables
– Specifies how missing values will be
assigned

Data coding
• Translation of responses on the
questionnaires or data collection sheets to
specific categories for the purpose of
analysis.
• Assignment of numbers to the various
levels of the variables.
• Load of work light for pre-coded
questionnaires

• Important and tedious for open ended
questions.

• Need to assign numerical codes to
categorical data before entering
• For example, you may choose to assign
codes of 1, 2, 3 and 4 to categories of “no
pain”, „mild pain”, “moderate pain” and
“severe pain” respectively

• These codes can be put in the
questionnaire when collecting the data.
• For binary data e.g. yes/no answers, it is
often convenient to assign codes 1 (e.g.
for yes) and 0 or 2 (for no).

NEED FOR CODING GUIDE/Data
Dictionary
• Prepare data in format to allow use of
computers for statistical analysis.
• Prepare code book or data dictionary for
the questionnaire.
• Specify range of values expected.

• Unit of measurement should be consistent
for all observations on a variable. E.g.
weight should be recorded in kg or in
pounds , but not both interchangeably
• Time: days? Hours?
– For example length of hospital stay

Example of data dictionary
Variable Variable label Value labels width Remark
Name
1. Age AGE --------- 2 Missing=99
2. Sex SEX 1=male 1 -----
2=female

3. Do you SMOKE 1=YES 1 Missing=9
smoke? 2=NO

Example
• Topic: Smoking among medical students
• 200 questionnaires/records
• 6 questions/variables

Variables in questionnaire
• Serial Number
• Age
• Place of residence
1=on campus, 2=off campus
• Sex
1=male, 2= female
• Do you smoke?
1=yes 2= No

• At what age did you start to
smoke? -------

Exercise

– Complete the data dictionary above

SPSS windows
• Data editor
– For data entry
– For statistical analysis
• Viewer
– Results are displayed

Data Editor
• Two views
– Data view: for data entry
– Variable view: to define variable
characteristics

Preparing Data Structures in
SPSS
• Variable views
– Variable names
– Variable types
– Value labels
– Variable width
– Column
– Measure

Data Entry
• Use of computer packages such as SPSS
– Improves the accuracy and speed of data analysis
• Makes it easy to check for errors, produces graphical
summaries and generates new variables
- Log in data as it arrives
• Frequent backing-up

• Problems with dates and times:
– dates and times should be entered in a
consistent manner, e.g. as day/month/year or
month/day/year but not interchangeably.
– It is important to find out what format the
statistical package can read

Handling missing data

• Consider what to do with missing values before data is
entered.
• In most cases, need to use some symbol to represent a
missing value
• Statistical packages deal with missing values in different
ways
• Some use special characters (e.g. a full stop or asterisk)
to indicate missing values, whereas others require you to
define your own code for a missing value (commonly
used values are 9, 99 or 999)
• The value that is chosen should be one that is not
possible for that variable

• For example when entering a categorical
variable with four categories (coded 1,2,3,
and 4), you may choose the value 9 to
represent missing values.
• However, if the variable is age of a child,
then a different code should be chosen.

• If a large proportion of data is missing, then the results
are likely to be unreliable

• Reasons why data are missing should always be
investigated: how much is missing and why?

• If missing data tend to cluster around a particular
variable, or in a particular sub group of individuals, then
it ,may indicate that the variable is not applicable or has
not been measured for that group of individuals

• Then the group of individuals should be
excluded from any analysis on that
variable
• Or it may be that the data is simply sitting
on a piece of paper in someone‟s drawer
and are yet to be entered!

Errors in data
• In any study, there is always the potential
for errors to occur in a data set, either at
the outset when taking the measurements,
or when collecting, and entering data onto
a computer

• It is hard to eliminate all of these errors

• But one can reduce the number of typing
errors by checking the data carefully once
they have been entered.

Common sources of error
• „not applicable‟ or „blank‟ coded as “0”
• typing errors on data entry- 18 INSTEAD
of 81
• column shift- data for one variable column
was entered under the adjacent column
• coding errors
• Loss of concentration

Data cleaning
Two-step process-
• detection and
• correction of errors
.

Detecting errors
- Check for completeness and
correctness of records.
- Indicate admissible values during
data entry
- Range checks-permissible responses.

- Statistical editing

How to Detect Errors via
Statistical editing
• Produce descriptive statistics for all
variables.
• Check frequency distribution of each
variable 1=male, 2=female, 3?
• Standard deviation higher than mean;
check for outlying observation

Quality Control

- Record verification (double entry)
- Does not rule out the possibility that the same error
has been incorrectly entered on the two occasions
- Disadvantage of this approach is that it takes twice as
long to enter the data, which may have major cost or
time implications
- Creating check files
- Random checking: selection at random but
should represent all forms being entered

Error checking
• Categorical Data: relatively easy, values not allowable
must be errors
• Check frequency distribution of each variable 1=male,
2=female, 3?

• Numerical data: Produce descriptive statistics for all
variables.
• Standard deviation higher than mean; check for outlying
observation

• range checks, upper and lower limits can be specified for
each variable

• Dates: not easy to check accuracy of dates, for
example 30th feb. must be incorrect, any day of
the month greater than 31, any month greater
than 12
• Apply logical checks:
– date of birth should correspond to patient‟s age
– subjects should usually have been born before
entering the study( at least in most studies)
– patients who have died should not appear on
subsequent follow up visits
– there should be no pregnant men

• With all error checks, a value should only
be corrected if there is evidence that a
mistake has been made
• Do not change values simply because
they look unusual; investigate

Summary
• Experience comes with practice
• Input influences output

RSS 2012 Data Entry SPSS

More Related Content

What's hot

Viewers also liked

Similar to RSS 2012 Data Entry SPSS

More from Wesam Abuznadah

Recently uploaded

RSS 2012 Data Entry SPSS