ANALYZING QUANTITATIVE DATAPresented by: Jean Marie Villamor
DEALING WITH DATACODING DATA systematically reorganizing raw data into a format that is machine-readable (i.e. easy to analyze using computers) Coding can be a clerical task when the data are recorded as numbers on well-organized recording sheets, but it is very difficult when, for example, a researcher wants to code answers to open-ended survey questions into numbers in a process similar to latent content analysis.
DEALING WITH DATA Researchers uses coding system and a codebook for data coding: Coding system set of rules stating that certain numbers are assigned to variable attributes. example: a researcher codes males as 1 and females as 2. Codebook a document describing the coding system and the location of data for variables in a format that computers can use.
DEALING WITH DATA Precoding placing the code categories on the questionnaire. If a researcher does not precode, his first step after collecting data is to create a codebook. He also gives each case an identification number to keep track of the cases. Next he transfers the information from each questionnaire into a format that computers can read.
ENTERING DATA Most computer programs designed for data analysis need the data in a grid format. In the grid, each row represents a respondent, subject, or case DATA RECORDS because each is a record of data for a single case Column or sets of columns represents specific variables.
CLEANING DATA Accuracy is extremely important when coding data. Errors made in coding can cause misleading results. Highly recommended to recheck all coding. Coding can be verified in two ways: Possible code cleaning (wild code checking) involves checking the categories of all variables for impossible codes Contingency cleaning (consistency checking) involves cross-classifying two variables and looking for logically impossible combinations.
COMPUTERS AND SOCIAL RESEARCH Researchers use computers to perform specialized tasks more efficiently and effectively. Example: organize data, calculate statistics, write reports
COMPUTERS AND SOCIAL RESEARCH Computer began as mechanical devices for sorting cards that had holes punched into them. Each hole were punched in specific locations and represented information about a variable. These machines organized vast amounts of information more quickly, reliably, and efficiently than the paper-and- pencil methods. IBM CARDS thin cardboard cards that had 80 columns and 12 rows, or 960 spaces for information.
COMPUTERS AND SOCIAL RESEARCH COMPUTER TERMINAL is a simple typewriter-like device connected to a mainframe computer, which people use to type data and instructions directly into the computer. MICROCOMPUTERS have replaced mainframes for many tasks, to more people, and have stimulated new uses form computers. Three basic parts: Monitor (CRT or VDT) Keyboard CPU (Central Processing Unit)
COMPUTERS AND SOCIAL RESEARCH Information can get into a microcomputer in 4 ways: Some is built into the computer memory itself User can type it on the keyboard It can come across a telephone line and into the computer through a modem It can be stored on floppy disks, or data travellers which computers can read.
HOW COMPUTERS HELP THE RESEARCHER Purpose of the computer: to write reports, to organize large amounts of data, and to compute statistical measures. Computers help researchers perform tasks faster and with greater accuracy than by hand. Without the appropriate computer, a researcher cannot analyze data from a large-scale research project or calculate complicated statistics. SPSS statistical package for the social sciences is a statistical software most social researchers use that is specifically designed for analyzing quantitative social science data.
STATISTICS a branch of applied mathematics used to manipulate and summarize the features of numbers. Descriptive statistics describe numerical data. They can be categorized by the number of variables involved: Univariate Bivariate Multivariate
RESULTS WITH ONE VARIABLE (UNIVARIATE) Frequency Distributions Measures of Central Tendency Measures of Variation
RESULTS WITH ONE VARIABLE Univariate describes one variable. The easiest way to describe the numerical data of one variable is through frequency distribution.FREQUENCY DISTRIBUTION can be used with nominal-, ordinal-, interval-, or ratio-level data and takes many forms.Example: A data having 400 respondents can be summarized on the gender of the respondents with a raw count or a percentage frequency distribution.
GENDER COUNT OF RESPONDENTSRAW COUNT FREQUENCY PERCENTAGE FREQUENCYDISTRIBUTIONDistributionGender Frequency Gender FrequencyMale 100 Male 25%Female 300 Female 75%Total 400 Total 100%BAR CHART Females Column2 Column1 Males frequency distribution 0 20 40 60 80
RESULTS WITH ONE VARIABLE Common types of graphic representations: Histogram Bar chart Pie chart For interval- or ratio-level data researcher groups the information into categories; grouped categories should be mutually exclusive Interval- or ratio-level data are often plotted in a frequency polygon.
RESULTS WITH ONE VARIABLE MEASURES OF CENTRAL TENDENCY Mean arithmetic average most widely used measure of central tendency can be used only with interval- or ratio-level data. Median middle point 50th percentile, or the point at which half the cases are above it and half below it can be used with ordinal-, interval-, or ratio-level data (not nominal-level) Mode easiest to use and can be used with nominal, ordinal , interval, or ratio data most common or frequently occurring number
RESULTS WITH ONE VARIABLE If the frequency distribution forms a NORMAL or BELL- SHAPED CURVE, the three measures of central tendency equal each other number of cases mean , median, mode Lowest values of variables highest
RESULTS WITH ONE VARIABLE SKEWED DISTRIBUTION measures of central tendency are not equal mode median mean mode median mean
RESULTS WITH ONE VARIABLE If most cases have lower scores with a few extreme high scores, the mean will be the highest, the median in the middle, and the mode the lowest. If most cases have higher scores with a few extreme low scores, the mean will be the lowest, the median in the middle, and the mode the highest. In general, the median is best for skewed distributions, although the mean is used in most other statistics.
RESULTS WITH ONE VARIABLE MEASURES OF VARIATION one- number summary of a distribution Three basic ways: Range Percentile Standard deviation
RESULTS WITH ONE VARIABLE RANGE simplest largest and smallest score for ordinal-, interval-, and ratio-level PERCENTILES tell the score at a specific place within the distribution for ordinal-, interval-, and ratio-level STANDARD DEVIATION most difficult to compute measure of dispersion; yet is also the most comprehensive and widely used. interval or ratio level based on the mean and gives an “average distance” between all scores and the mean
RESULTS WITH ONE VARIABLESTEPS IN COMPUTING THE STANDARD DEVIATION1. Compute the mean.2. Subtract the mean from each score.3. Square the resulting difference for each score.4. Total up the squared differences to get the sum of squares.5. Divide the sum of squares by the number of cases to get the variance.6. Take the square root of the variance, which is the standard deviation.
RESULTS WITH ONE VARIABLE Standard deviation and the mean are used to create z- scores. Z-scores let a researcher compare two or more distribution or groups. also called a standardized score, expresses points or scores on a frequency distribution in terms of a number of standard deviations from the mean. Scores are in terms of their relative position within a distribution, not as absolute values.
EXAMPLE OF COMPUTING THE STANDARD DEVIATION[8 RESPONDENTS, VARIABLE = YEARS OF SCHOOLING] score score- mean squared (score-mean) 15 2.5 6.25 12 -0.5 .25 12 -0.5 .25 10 -2.5 6.25 16 3.5 12.25 18 5.5 30.25 8 -4.5 20.25 9 -3.5 12.25 Mean = 12.5 sum of squares = 88Variance = sum of squares = 88 = 11 no. of cases 8Standard deviation = square root of variance = √11 = 3.317 years
Symbols: Formula:X = score of case Standard deviation √ ∑(x-x )2∑ = sigma for sum NN = number of casesX = mean
RESULTS WITH TWO VARIABLES (BIVARIATE) The Scattergram Cross-tabulation/ Percentaged Table Measures of Association
RESULTS WITH TWO VARIABLES Bavariate statistics let a researcher consider two variables together and describe the relationship between variables Statistical relationships are based on two ideas: Covariation things that are associated Example: life expectancy and income Independence no association or relationship between two variables Example: number of siblings and life expectancy
RESULTS WITH TWO VARIABLES SCATTERGRAM a graph on which a researcher plots each case or observation, where each axis represents the value of one variable. used for interval- or ratio-level data; rarely for ordinal; never for nominal independent variable = horizontal axis (x) dependent variable = vertical axis (y)
RESULTS WITH TWO VARIABLES Three aspects of a bivariate relationship in a scattergram: Form Independence random scatter with no pattern, or a straight line that is parallel to one of the axis Linear straight line that can be visualized in the middle of a maze of cases running from one corner to another Curvilinear centre of a maze of cases could form a U curve, right side up or upside down, or an S curve Direction Positive and negative Precision Amount of spread in the points on the graph High level occurs when the points hug the line that summarizes the relationships Low level occurs when the points are widely spread around the line
RESULTS WITH TWO VARIABLES PERCENTAGED TABLES presents the same information as a scattergram in a more condensed form Cross-tabulation cases are organized in the table on the basis of two variables at the same time. Bavariate tables usually contain percentages.
RESULTS WITH TWO VARIABLESCONSTRUCTING PERCENTAGE TABLES1. Raw data2. Compound frequency distribution (CFD) Figure all possible combinations of variable categories Make a mark next to the combination category into which each case falls Add up the marks for the number of cases in a combination category3. Set up the parts of a table (labeling rows and columns)4. Each number from the CFD is placed in a cell in the table that corresponds to the combination of variable categories.
THE PARTS OF A TABLE Title Label row and column variable and give a name to each of the variable categories. Marginals totals of the columns and rows Body of the table Cell of a table If there is missing information (cases in which a respondent refused to answer, ended interview, said “I don’t know”, etc.), report the number of missing cases near the table to account for all original cases For percentaged tables, provide the number of cases or N on which percentages are computed in parentheses near the total of 100%. This makes it possible to go back and forth from a percentaged table to a raw count table and vice versa.
RAW COUNT TABLEAGE GROUP BY ATTITUDE ABOUT CHANGING THE DRINKING AGE Age group Attitude Under 30 30-45 46-60 61 and total older Agree 20 10 4 3 37No opinion 3 (4) 10 10 2 25 Disagree 3 5 21 10 39 _____ _____ _____ _____ _____ Total 26 25 35 15 101 Missing =8cases (6)
COLUMN PERCENTAGED TABLEAGE GROUP BY ATTITUDE ABOUT CHANGING THE DRINKING AGE Age group Attitude Under 30 30-45 46-60 61 and Total older Agree 76.9 % 40% 11.14% 20% 36.6%No opinion 11.5 % 40 % 28.6 % 13.3 % 24.8% Disagree 11.5% 20% 60% 66.7% 38.6% _____ ______ ______ ______ _____ Total 99.9% 100% 100% 100% 100% (N) (26) (25) (35) (15) (101) Missing =8 cases
ROW PERCENTAGED TABLEAGE GROUP BY ATTITUDE ABOUT CHANGING THE DRINKING AGE Age groupAttitude Under 30 30-45 46-60 61 and Total (N) older Agree 54.1% 27% 10.8% 8.1% 100 (37) No 12% 40% 40% 8% 100 (25)opinionDisagree 7.7% 12.8% 53.8% 25.6% 99.9 (39) _____ ______ _____ _____ _____ Total 25.7% 24.8% 34.7% 14.9% 100.1 (101)Missing =8cases
RESULTS WITH TWO VARIABLES Three ways to percentage a table: By row By column For the total First two are most often used; last rareWhich is best? Either can be appropriate.
RESULTS WITH TWO VARIABLES MEASURES OF ASSOCIATION single number that expresses the strength, and often the direction, of a relationship; condenses information about a bivariate relationship into a single number Many measures are called by letters of the greek alphabet (lambda, gamma, tau, chi (squared), and rho). The emphasis is on interpreting the measures, not on their calculation.
SUMMARY OF MEASURES OF ASSOCIATION Measure Greek Type of data High Independence symbol association Lambda λ Nominal 1.0 0 Gamma γ Ordinal +1.0, -1.0 0 Tau τ Ordinal +1.0, -1.0 0(Kendall’s) Rho ρ Interval, ratio +1.0, -1.0 0Chi-squared χ2 Nominal, Infinity 0 ordinal
MORE THAN TWO VARIABLES (MULTIVARIATE) Trivariate percentaged tables Multiple regression analysis
MORE THAN TWO VARIABLES STATISTICAL CONTROL control variables Researcher controls for alternative explanations in multivariate analysis by introducing a third variable. Example: Bavariate table showing that taller teens like baseball more than shorter ones May be spurious since male teens are taller than females, and males tend to like baseball more than females. To test whether the relationship is due to sex, a researcher must control for gender
MORE THAN TWO VARIABLES TRIAVARIATE PERCENTAGED TABLES Trivariate tables has a bivariate table of the independent and dependent variable for each category of the control variable. partials Number of partials depends on the number of categories in the control variable. Trivariate tables have 3 limitations: Difficult to interpret if a control variable has more than 4 categories. Control variables can be at any level of measurement, but interval or ratio control variables must be grouped, and how cases are grouped can affect the interpretations of effects. Total number of cases is a limiting factor because the cases are divided among cells in partial.
MORE THAN TWO VARIABLESP Number of cells in the partials C number of cells in the bivariate relationship N number of categories in the control variable P=CxNExample: Control variable having 3 categories, and a bivariate table having 12 cells. P= 12 x 3 = 36 cellsAn average of 5 cases per cell is recommended, so the researcher will need 5 x 6 = 180 cases at minimum
COMPOUND FREQUENCY DISTRIBUTION FOR TRIVARIATE TABLEMales FemalesAge Attitude No. of cases Age Attitude No. of casesUnder 30 Agree 10 Under 30 Agree 10Under 30 No option 1 Under 30 No option 2Under 30 Disagree 2 Under 30 Disagree 130-45 Agree 5 30-45 Agree 530-45 No option 5 30-45 No option 530-45 Disagree 2 30-45 Disagree 346-60 Agree 2 46-60 Agree 246-60 No option 5 46-60 No option 546-60 Disagree 11 46-60 Disagree 1061 and older Agree 3 61 and older Agree 061 and older No option 0 61 and older No option 261 and older Disagree 5____ 61 and older Disagree 5______ subtotal 51 subtotal 50Missing on either variable 4____ Missing on either variable 4______No. of males 55 No. of females 54
PARTIAL TABLE FOR MALESAttitude Under 30 30-45 46-60 61 and older Total Agree 10 5 2 3 20No option 1 5 5 0 11Disagree 2 2 11 5 20 Total 13 12 18 8 51Missing =4cases PARTIAL TABLE FOR FEMALESAttitude Under 30 30-45 46-60 61 and older Total Agree 10 5 2 0 17No option 2 5 5 2 14Disagree 1 3 10 5 19 Total 13 13 17 7Missing =4cases
MORE THAN TWO VARIABLES Elaboration paradigm system for reading percentaged trivariate tables describes the pattern that emerges when a control variable is introduced Four terms describe how the partial tables compare to the initial bivariate table, or how the original bivariate relationship changes after the control variable is considered.
SUMMARY OF THE ELABORATION PARADIGMPattern name Pattern seen when comparing partials to the original bivariate tableReplication Same relationship in both partials as in bivariate tableSpecification Bivariate relationship seen in one of the partial tablesInterpretation Bivariate relationship weakens greatly or disappears in the partial tables (control variable intervening)Explanation Bivariate relationship weakens greatly or disappears in the partial tables (control variable before independent)Suppressor variable No bivariate relationship, relationship appears in partial tables only
EXAMPLES OF ELABORATION PATTERNS REPLICATION Bivariate table Partials Control = low Control = high Low High Low High Low HighLow 85% 15% Low 84% 16% 86% 14%High 15% 85% High 16$ 84% 14% 86% INTERPRETATION OR EXPLANATION Bivariate table Partials Control = low Control = high Low High Low High Low HighLow 85% 15% Low 45% 55% 55% 45%High 15% 85% High 55% 45% 45% 55%
EXAMPLES OF ELABORATION PATTERNS SPECIFICATION Bivariate table Partials Control = low Control = high Low High Low High Low High Low 85% 15% Low 95% 5% 50% 50%High 15% 85% High 5% 95% 50% 50% SUPPRESSOR VARIABLE Bivariate table Partials Control = low Control = high Low High Low High Low High Low 54% 46% Low 84% 16% 14% 86%High 46% 54% High 16% 84% 86% 14%
MORE THAN TWO VARIABLES Replication pattern easiest to understand; when the partials replicate or reproduce the same relationship that existed in the bivariate table before considering the control variable control variable has no effect Specification pattern occurs when one partial replicates the initial bivariate relationship, but other partials do not. researcher can specify the category of the control variable in which the initial relationship persistsExample: college grades and automobiles
MORE THAN TWO VARIABLES Interpretation pattern describes the situation in which the control variable intervenes between the original independent and dependent variable. Explanation pattern looks the same as interpretation; difference is the temporal order of the control variable control variable comes before the independent variable in the initial bivariate relationshipSuppressor variable pattern occurs when the bivariate tables suggest independence but a relationship appears in one or both of the partials
MORE THAN TWO VARIABLES MULTIPLE REGRESSION statistical technique which is quickly computed by an appropriate statistics software Regression results measure the direction and size of the effect of each variable on a dependent variable. The effect is measured precisely and given a numerical value. The effect on the dependent variable is measured by a standardized coefficient or the Greek letter beta (β). It is equal to the ρ correlation coefficient.
MORE THAN TWO VARIABLES Researchers use the beta regression coefficient to determine whether control variables have an effect.Example: the bivariate correlation between X and Y is 0.75. Researcher statistically considers four control variables. If the beta remains at 0.75, then the four control variables have no effect. However, if the beta for X and Y gets smaller, it indicates that the control variables have an effect.
INFERENTIAL STATISTICS use probability theory to test hypotheses formally, permit inferences from a sample to a population, and test whether descriptive results are likely to be due to random factors or to a real relationship. are a more powerful type of statistics than descriptive statistics rely on principles from probability sampling, where a researcher uses a random process to select a subset of cases from the entire population. used when researchers conduct various statistical test (e.g. t-test or an F-test)
INFERENTIAL STATISTICS Statistical significance results are not likely to be due to chance factors indicates the probability of finding a relationship in the sample where there is none in the population uses probability theory and specific statistical tests to tell a researcher whether the results are likely to be produced by random error in random sampling
INFERENTIAL STATISTICS Levels of significance (usually .05, .01, or .001) is a way of talking about the likelihood that the results are due to chance factors –that is, a relationship appears in the sample when there is none in the population. If a researcher says that results are significant at the .05 level, this means: Results like these are due to a chance factors only 5 in 100 times There is a 95% chance that the sample results are not due to chance factors alone, but reflect the population accurately The odds of such results based on chance alone are .05 or 5% One can be 95% confident that the results are due to a real relationship in the population, not chance factors
INFERENTIAL STATISTICS Type I error occurs when the researcher says that a relationship exists when in fact none exists falsely rejecting a null hypothesis Type II error occurs when a researcher says that a relationship does not exist when in fact it does. falsely accepting a null hypothesis True situation in the worldWhat the researcher says No relationship Causal relationshipNo relationship No error Type II errorCausal relationship Type I error No error