Analyzing quantitative data

ANALYZING QUANTITATIVE DATA
Presented by: Jean Marie Villamor

DEALING WITH DATA
CODING DATA systematically reorganizing raw data into
a format that is machine-readable (i.e. easy to analyze
using computers)

 Coding can be a clerical task when the data are recorded
as numbers on well-organized recording sheets, but it is
very difficult when, for example, a researcher wants to
code answers to open-ended survey questions into
numbers in a process similar to latent content analysis.

DEALING WITH DATA
 Researchers uses coding system and a codebook for
data coding:

 Coding system  set of rules stating that certain numbers are
assigned to variable attributes.
example: a researcher codes males as 1 and females as 2.

 Codebook  a document describing the coding system and
the location of data for variables in a format that computers
can use.

DEALING WITH DATA
 Precoding  placing the code categories on the
questionnaire.

 If a researcher does not precode, his first step after
collecting data is to create a codebook. He also gives
each case an identification number to keep track of the
cases. Next he transfers the information from each
questionnaire into a format that computers can read.

ENTERING DATA

 Most computer programs designed for data analysis
need the data in a grid format.

 In the grid, each row represents a respondent, subject, or
case  DATA RECORDS because each is a record of
data for a single case

 Column or sets of columns represents specific variables.

CLEANING DATA
 Accuracy is extremely important when coding data.
 Errors made in coding can cause misleading results.

 Highly recommended to recheck all coding.

 Coding can be verified in two ways:
 Possible code cleaning (wild code checking)
 involves checking the categories of all variables for impossible
codes
 Contingency cleaning (consistency checking)
 involves cross-classifying two variables and looking for
logically impossible combinations.

COMPUTERS AND SOCIAL RESEARCH
 Researchers use computers to perform specialized tasks
more efficiently and effectively.

Example: organize data, calculate statistics, write reports

 Computer began as mechanical devices for sorting cards
that had holes punched into them.

 Each hole were punched in specific locations and
represented information about a variable.

 These machines organized vast amounts of information
more quickly, reliably, and efficiently than the paper-and-
pencil methods.

 IBM CARDS  thin cardboard cards that had 80
columns and 12 rows, or 960 spaces for information.

 COMPUTER TERMINAL is a simple typewriter-like
device connected to a mainframe computer, which
people use to type data and instructions directly into the
computer.
 MICROCOMPUTERS have replaced mainframes for
many tasks, to more people, and have stimulated new
uses form computers.
 Three basic parts:
 Monitor (CRT or VDT)
 Keyboard
 CPU (Central Processing Unit)

 Information can get into a microcomputer in 4 ways:
 Some is built into the computer memory itself
 User can type it on the keyboard
 It can come across a telephone line and into the computer
through a modem
 It can be stored on floppy disks, or data travellers which
computers can read.

HOW COMPUTERS HELP THE RESEARCHER

 Purpose of the computer: to write reports, to organize
large amounts of data, and to compute statistical
measures.
 Computers help researchers perform tasks faster and
with greater accuracy than by hand.
 Without the appropriate computer, a researcher cannot
analyze data from a large-scale research project or
calculate complicated statistics.
 SPSS statistical package for the social sciences is a
statistical software most social researchers use that is
specifically designed for analyzing quantitative social
science data.

STATISTICS
 a branch of applied mathematics used to manipulate and
summarize the features of numbers.

 Descriptive statistics describe numerical data. They can
be categorized by the number of variables involved:
 Univariate
 Bivariate
 Multivariate

RESULTS WITH ONE VARIABLE (UNIVARIATE)
 Frequency Distributions
 Measures of Central Tendency

 Measures of Variation

RESULTS WITH ONE VARIABLE
 Univariate describes one variable. The easiest way to
describe the numerical data of one variable is through
frequency distribution.

FREQUENCY DISTRIBUTION  can be used with
nominal-, ordinal-, interval-, or ratio-level data and takes
many forms.
Example: A data having 400 respondents can be
summarized on the gender of the respondents with a raw
count or a percentage frequency distribution.

GENDER COUNT OF RESPONDENTS
RAW COUNT FREQUENCY PERCENTAGE FREQUENCY
DISTRIBUTION
Distribution
Gender Frequency Gender Frequency
Male 100 Male 25%
Female 300 Female 75%
Total 400 Total 100%

BAR CHART

Females
Column2
Column1
Males frequency distribution

0 20 40 60 80

 Common types of graphic representations:
 Histogram
 Bar chart
 Pie chart

 For interval- or ratio-level data  researcher groups the
information into categories; grouped categories should be
mutually exclusive

 Interval- or ratio-level data are often plotted in a
frequency polygon.

 MEASURES OF CENTRAL TENDENCY
 Mean  arithmetic average
 most widely used measure of central tendency
 can be used only with interval- or ratio-level data.
 Median middle point
 50th percentile, or the point at which half the cases
are above it and half below it
 can be used with ordinal-, interval-, or ratio-level
data (not nominal-level)
 Mode easiest to use and can be used with nominal, ordinal
, interval, or ratio data
 most common or frequently occurring number

 If the frequency distribution forms a NORMAL or BELL-
SHAPED CURVE, the three measures of central
tendency equal each other

number of cases
mean , median, mode

Lowest values of variables highest

 SKEWED DISTRIBUTION  measures of central
tendency are not equal
mode
median
mean

mode
median
mean

 If most cases have lower scores with a few extreme high
scores, the mean will be the highest, the median in the
middle, and the mode the lowest.

 If most cases have higher scores with a few extreme low
scores, the mean will be the lowest, the median in the
middle, and the mode the highest.

 In general, the median is best for skewed
distributions, although the mean is used in most other
statistics.

 MEASURES OF VARIATION  one- number summary
of a distribution

 Three basic ways:
 Range
 Percentile
 Standard deviation

 RANGE  simplest
 largest and smallest score
 for ordinal-, interval-, and ratio-level
 PERCENTILES  tell the score at a specific place within
the distribution
 for ordinal-, interval-, and ratio-level
 STANDARD DEVIATION  most difficult to compute
measure of dispersion; yet is also the most
comprehensive and widely used.
 interval or ratio level
 based on the mean and gives an “average
distance” between all scores and the mean

STEPS IN COMPUTING THE STANDARD DEVIATION
1. Compute the mean.
2. Subtract the mean from each score.
3. Square the resulting difference for each score.
4. Total up the squared differences to get the sum of
squares.
5. Divide the sum of squares by the number of cases to get
the variance.
6. Take the square root of the variance, which is the
standard deviation.

 Standard deviation and the mean are used to create z-
scores.
 Z-scores  let a researcher compare two or more
distribution or groups.
 also called a standardized score, expresses
points or scores on a frequency distribution in terms of a
number of standard deviations from the mean. Scores
are in terms of their relative position within a
distribution, not as absolute values.

EXAMPLE OF COMPUTING THE STANDARD DEVIATION
[8 RESPONDENTS, VARIABLE = YEARS OF SCHOOLING]
score score- mean squared (score-mean)
15 2.5 6.25
12 -0.5 .25
12 -0.5 .25
10 -2.5 6.25
16 3.5 12.25
18 5.5 30.25
8 -4.5 20.25
9 -3.5 12.25
Mean = 12.5 sum of squares = 88

Variance = sum of squares = 88 = 11
no. of cases 8
Standard deviation = square root of variance = √11 = 3.317 years

Symbols: Formula:

X = score of case Standard deviation
√ ∑(x-x )2
∑ = sigma for sum N

N = number of cases

X = mean

RESULTS WITH TWO VARIABLES (BIVARIATE)
 The Scattergram
 Cross-tabulation/ Percentaged Table

 Measures of Association

RESULTS WITH TWO VARIABLES
 Bavariate statistics  let a researcher consider two
variables together and describe the relationship between
variables

 Statistical relationships are based on two ideas:
 Covariation  things that are associated
Example: life expectancy and income

 Independence  no association or relationship between two
variables
Example: number of siblings and life expectancy

 SCATTERGRAM  a graph on which a researcher plots
each case or observation, where each axis represents
the value of one variable.
 used for interval- or ratio-level data; rarely for ordinal;
never for nominal

independent variable = horizontal axis (x)
dependent variable = vertical axis (y)

 Three aspects of a bivariate relationship in a scattergram:
 Form
 Independence  random scatter with no pattern, or a straight line that
is parallel to one of the axis
 Linear  straight line that can be visualized in the middle of a maze of
cases running from one corner to another
 Curvilinear centre of a maze of cases could form a U curve, right
side up or upside down, or an S curve
 Direction
 Positive and negative
 Precision
 Amount of spread in the points on the graph
 High level occurs when the points hug the line that summarizes the
relationships
 Low level occurs when the points are widely spread around the line

 PERCENTAGED TABLES  presents the same
information as a scattergram in a more condensed form
 Cross-tabulation  cases are organized in the table on
the basis of two variables at the same time.

 Bavariate tables usually contain percentages.

CONSTRUCTING PERCENTAGE TABLES
1. Raw data
2. Compound frequency distribution (CFD)
 Figure all possible combinations of variable categories
 Make a mark next to the combination category into which each
case falls
 Add up the marks for the number of cases in a combination
category
3. Set up the parts of a table (labeling rows and columns)
4. Each number from the CFD is placed in a cell in the
table that corresponds to the combination of variable
categories.

THE PARTS OF A TABLE
 Title
 Label row and column variable and give a name to each
of the variable categories.
 Marginals  totals of the columns and rows
 Body of the table
 Cell of a table
 If there is missing information (cases in which a
respondent refused to answer, ended interview, said “I
don’t know”, etc.), report the number of missing cases
near the table to account for all original cases
 For percentaged tables, provide the number of cases or
N on which percentages are computed in parentheses
near the total of 100%. This makes it possible to go back
and forth from a percentaged table to a raw count table
and vice versa.

RAW COUNT TABLE
AGE GROUP BY ATTITUDE ABOUT CHANGING THE DRINKING AGE
Age group
Attitude Under 30 30-45 46-60 61 and total
older
Agree 20 10 4 3 37
No opinion 3 (4) 10 10 2 25
Disagree 3 5 21 10 39
_____ _____ _____ _____ _____
Total 26 25 35 15 101
Missing =8
cases (6)

COLUMN PERCENTAGED TABLE


Age group
Attitude Under 30 30-45 46-60 61 and Total
older
Agree 76.9 % 40% 11.14% 20% 36.6%
No opinion 11.5 % 40 % 28.6 % 13.3 % 24.8%
Disagree 11.5% 20% 60% 66.7% 38.6%
_____ ______ ______ ______ _____
Total 99.9% 100% 100% 100% 100%
(N) (26) (25) (35) (15) (101)
Missing =8
cases

ROW PERCENTAGED TABLE

Age group
Attitude Under 30 30-45 46-60 61 and Total (N)
older
Agree 54.1% 27% 10.8% 8.1% 100 (37)
No 12% 40% 40% 8% 100 (25)
opinion
Disagree 7.7% 12.8% 53.8% 25.6% 99.9 (39)
_____ ______ _____ _____ _____
Total 25.7% 24.8% 34.7% 14.9% 100.1 (101)
Missing =8
cases

 Three ways to percentage a table:
 By row
 By column
 For the total

 First two are most often used; last rare

Which is best?
 Either can be appropriate.

 MEASURES OF ASSOCIATION  single number that
expresses the strength, and often the direction, of a
relationship; condenses information about a bivariate
relationship into a single number

 Many measures are called by letters of the greek
alphabet (lambda, gamma, tau, chi (squared), and rho).
The emphasis is on interpreting the measures, not on
their calculation.

SUMMARY OF MEASURES OF ASSOCIATION
Measure Greek Type of data High Independence
symbol association
Lambda λ Nominal 1.0 0
Gamma γ Ordinal +1.0, -1.0 0
Tau τ Ordinal +1.0, -1.0 0
(Kendall’s)
Rho ρ Interval, ratio +1.0, -1.0 0

Chi-squared χ2 Nominal, Infinity 0
ordinal

MORE THAN TWO VARIABLES (MULTIVARIATE)
 Trivariate percentaged tables
 Multiple regression analysis

MORE THAN TWO VARIABLES
 STATISTICAL CONTROL  control variables
 Researcher controls for alternative explanations in
multivariate analysis by introducing a third variable.
Example: Bavariate table showing that taller teens like
baseball more than shorter ones
 May be spurious since male teens are taller than
females, and males tend to like baseball more than
females.
 To test whether the relationship is due to sex, a
researcher must control for gender

 TRIAVARIATE PERCENTAGED TABLES
 Trivariate tables has a bivariate table of the independent
and dependent variable for each category of the control
variable.  partials
 Number of partials depends on the number of categories
in the control variable.
 Trivariate tables have 3 limitations:
 Difficult to interpret if a control variable has more than 4
categories.
 Control variables can be at any level of measurement, but
interval or ratio control variables must be grouped, and how
cases are grouped can affect the interpretations of effects.
 Total number of cases is a limiting factor because the cases
are divided among cells in partial.

P Number of cells in the partials
C number of cells in the bivariate relationship
N number of categories in the control variable
P=CxN
Example: Control variable having 3 categories, and a
bivariate table having 12 cells.
P= 12 x 3 = 36 cells
An average of 5 cases per cell is recommended, so the
researcher will need 5 x 6 = 180 cases at minimum

COMPOUND FREQUENCY DISTRIBUTION FOR TRIVARIATE TABLE
Males Females
Age Attitude No. of cases Age Attitude No. of cases
Under 30 Agree 10 Under 30 Agree 10
Under 30 No option 1 Under 30 No option 2
Under 30 Disagree 2 Under 30 Disagree 1
30-45 Agree 5 30-45 Agree 5
30-45 No option 5 30-45 No option 5
30-45 Disagree 2 30-45 Disagree 3
46-60 Agree 2 46-60 Agree 2
46-60 No option 5 46-60 No option 5
46-60 Disagree 11 46-60 Disagree 10
61 and older Agree 3 61 and older Agree 0
61 and older No option 0 61 and older No option 2
61 and older Disagree 5____ 61 and older Disagree 5______
subtotal 51 subtotal 50
Missing on either variable 4____ Missing on either variable 4______
No. of males 55 No. of females 54

PARTIAL TABLE FOR MALES
Attitude Under 30 30-45 46-60 61 and older Total
Agree 10 5 2 3 20
No option 1 5 5 0 11
Disagree 2 2 11 5 20
Total 13 12 18 8 51
Missing =4
cases

PARTIAL TABLE FOR FEMALES
Attitude Under 30 30-45 46-60 61 and older Total
Agree 10 5 2 0 17
No option 2 5 5 2 14
Disagree 1 3 10 5 19
Total 13 13 17 7
Missing =4
cases

 Elaboration paradigm  system for reading percentaged
trivariate tables
 describes the pattern that emerges when a control
variable is introduced
 Four terms describe how the partial tables compare to
the initial bivariate table, or how the original bivariate
relationship changes after the control variable is
considered.

SUMMARY OF THE ELABORATION PARADIGM
Pattern name Pattern seen when comparing partials to the
original bivariate table
Replication Same relationship in both partials as in bivariate table
Specification Bivariate relationship seen in one of the partial tables
Interpretation Bivariate relationship weakens greatly or disappears in
the partial tables (control variable intervening)
Explanation Bivariate relationship weakens greatly or disappears in
the partial tables (control variable before independent)
Suppressor variable No bivariate relationship, relationship appears in partial
tables only

EXAMPLES OF ELABORATION PATTERNS

REPLICATION
Bivariate table Partials
Control = low Control = high
Low High Low High Low High
Low 85% 15% Low 84% 16% 86% 14%
High 15% 85% High 16$ 84% 14% 86%
INTERPRETATION OR EXPLANATION
Low 85% 15% Low 45% 55% 55% 45%
High 15% 85% High 55% 45% 45% 55%

EXAMPLES OF ELABORATION PATTERNS
SPECIFICATION
Low 85% 15% Low 95% 5% 50% 50%
High 15% 85% High 5% 95% 50% 50%
SUPPRESSOR VARIABLE
Low 54% 46% Low 84% 16% 14% 86%
High 46% 54% High 16% 84% 86% 14%

 Replication pattern  easiest to understand; when the
partials replicate or reproduce the same relationship that
existed in the bivariate table before considering the
control variable
 control variable has no effect
 Specification pattern  occurs when one partial replicates
the initial bivariate relationship, but other partials do not.
 researcher can specify the category of the control
variable in which the initial relationship persists
Example: college grades and automobiles

 Interpretation pattern  describes the situation in which
the control variable intervenes between the original
independent and dependent variable.
 Explanation pattern  looks the same as interpretation;
difference is the temporal order of the control variable
 control variable comes before the independent
variable in the initial bivariate relationship
Suppressor variable pattern  occurs when the bivariate
tables suggest independence but a relationship appears
in one or both of the partials

 MULTIPLE REGRESSION  statistical technique which
is quickly computed by an appropriate statistics software
 Regression results measure the direction and size of the
effect of each variable on a dependent variable. The
effect is measured precisely and given a numerical
value.
 The effect on the dependent variable is measured by a
standardized coefficient or the Greek letter beta (β). It is
equal to the ρ correlation coefficient.

 Researchers use the beta regression coefficient to
determine whether control variables have an effect.

Example: the bivariate correlation between X and Y is
0.75. Researcher statistically considers four control
variables. If the beta remains at 0.75, then the four
control variables have no effect. However, if the beta for
X and Y gets smaller, it indicates that the control
variables have an effect.

INFERENTIAL STATISTICS
 use probability theory to test hypotheses formally, permit
inferences from a sample to a population, and test
whether descriptive results are likely to be due to random
factors or to a real relationship.
 are a more powerful type of statistics than descriptive
statistics
 rely on principles from probability sampling, where a
researcher uses a random process to select a subset of
cases from the entire population.
 used when researchers conduct various statistical test
(e.g. t-test or an F-test)

 Statistical significance  results are not likely to be due
to chance factors
 indicates the probability of finding a relationship in the
sample where there is none in the population
 uses probability theory and specific statistical tests to
tell a researcher whether the results are likely to be
produced by random error in random sampling

 Levels of significance (usually .05, .01, or .001) is a way
of talking about the likelihood that the results are due to
chance factors –that is, a relationship appears in the
sample when there is none in the population.
 If a researcher says that results are significant at the .05
level, this means:
 Results like these are due to a chance factors only 5 in 100
times
 There is a 95% chance that the sample results are not due to
chance factors alone, but reflect the population accurately
 The odds of such results based on chance alone are .05 or 5%
 One can be 95% confident that the results are due to a real
relationship in the population, not chance factors

Type I error  occurs when the researcher says that a
relationship exists when in fact none exists
 falsely rejecting a null hypothesis
Type II error  occurs when a researcher says that a
relationship does not exist when in fact it does.
 falsely accepting a null hypothesis

True situation in the world
What the researcher says No relationship Causal relationship
No relationship No error Type II error
Causal relationship Type I error No error

Analyzing quantitative data

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (13)

Similar to Analyzing quantitative data

Similar to Analyzing quantitative data (20)

Recently uploaded

Recently uploaded (20)

Analyzing quantitative data