Successfully reported this slideshow.
Your SlideShare is downloading. ×

Unit 5.pptx

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Public Relation.ppt
Public Relation.ppt
Loading in …3
×

Check these out next

1 of 100 Ad
Advertisement

More Related Content

Recently uploaded (20)

Advertisement

Unit 5.pptx

  1. 1. UNIT 5
  2. 2. Topics to be covered • Unit-V: Data Analysis: Editing, Coding, Tabular representation of data, Graphical Representation of Data. • Questionnaire Construction, Content Analysis, Validity and Reliability Test. • Descriptive Statistics and Probability: Measures of Central Tendency, Dispersion, Skewness & Kurtosis, Probability and Laws, Random Variable, Expectation. • Probability Distribution and Sampling: Discrete, Binomial, Poisson, Continuous, Normal Sampling Distribution, Statistical Estimation. • Multivariate Data Analysis: Factor Analysis, Cluster Analysis, Discriminant Analysis, Multi- Dimensional Scaling, Conjoint Analysis.
  3. 3. Fieldwork/Data Collection Process Selecting Field Workers Training Field Workers Supervising Field Workers Validating Fieldwork Evaluating Field Workers
  4. 4. Selection of Field Workers The researcher should: • Develop job specifications for the project, taking into account the mode of data collection. • Decide what characteristics the field workers should have. • Recruit appropriate individuals.
  5. 5. General Qualifications of Field Workers • Healthy. Field workers must have the stamina required to do the job. • Outgoing. The interviewers should be able to establish rapport with the respondents. • Communicative. Effective speaking and listening skills are a great asset. • Pleasant appearance. If the field worker's physical appearance is unpleasant or unusual, the data collected may be biased. • Educated. Interviewers must have good reading and writing skills. • Experienced. Experienced interviewers are likely to do a better job.
  6. 6. Training of Field Workers • Making the Initial Contact – Interviewers should be trained to make opening remarks that will convince potential respondents that their participation is important. • Asking the Questions 1. Be thoroughly familiar with the questionnaire. 2. Ask the questions in the order in which they appear in the questionnaire. 3. Use the exact wording given in the questionnaire. 4. Read each question slowly. 5. Repeat questions that are not understood. 6. Ask every applicable question. 7. Follow instructions, skip patterns, probe carefully.
  7. 7. Training of Field Workers • Probing – Some commonly used probing techniques: 1. Repeating the question. 2. Repeating the respondent's reply. 3. Using a pause or silent probe. 4. Boosting or reassuring the respondent. 5. Eliciting clarification. 6. Using objective/neutral questions or comments.
  8. 8. Training of Field Workers • Recording the Answers – Guidelines for recording answers to unstructured questions: 1. Record responses during the interview. 2. Use the respondent's own words. 3. Do not summarize or paraphrase the respondent's answers. 4. Include everything that pertains to the question objectives. 5. Include all probes and comments. 6. Repeat the response as it is written down. • Terminating the Interview – The respondent should be left with a positive feeling about the interview.
  9. 9. Guidelines on Interviewer Training: The Council of American Survey Research Organizations Training should be conducted under the direction of supervisory personnel and should cover the following: 1) The research process: how a study is developed, implemented & reported. 2) Importance of interviewers; need for honesty, objectivity & professionalism. 3) Confidentiality of the respondent & client. 4) Familiarity with market research terminology. 5) Importance of following the exact wording & recording responses verbatim. 6) Purpose & use of probing & clarifying techniques. 7) The reason for & use of classification & respondent information questions. 8) A review of samples of instructions & questionnaires. 9) Importance of the respondent’s positive feelings about survey research. An interviewer must be trained in the interviewing techniques outlined above.
  10. 10. Guidelines on Supervision: The Council of American Survey Research Organizations All research projects should be properly supervised. It is the data collection agency’s responsibility to: 1) Properly supervise interviews. 2) See that an agreed-upon proportion of interviewers’ telephone calls are monitored. 3) Be available to report on the status of the project daily to the project director, unless otherwise instructed. 4) Keep all studies, materials, and findings confidential. 5) Notify concerned parties if the anticipated schedule is not met. 6) Attend all interviewer briefings. 7) Keep current & accurate records of the interviewing progress. 8) Make sure all interviewers have all materials in time. 9) Edit each questionnaire. 10) Provide consistent & positive feedback to the interviewers. 11) Not falsify any work.
  11. 11. Supervision of Field Workers Supervision of field workers means making sure that they are following the procedures and techniques in which they were trained. Supervision involves quality control and editing, sampling control, control of cheating, and central office control. • Quality Control and Editing – This requires checking to see if the field procedures are being properly implemented. • Sampling Control – The supervisor attempts to ensure that the interviewers are strictly following the sampling plan. • Control of Cheating – Cheating can be minimized through proper training, supervision, and validation. • Central Office Control – Supervisors provide quality and cost-control information to the central office.
  12. 12. Validation of Fieldwork Validation: • The supervisors call 10 - 25% of the respondents to inquire whether the field workers actually conducted the interviews. • The supervisors ask about the length and quality of the interview, reaction to the interviewer, and basic demographic data. • The demographic information is cross-checked against the information reported by the interviewers on the questionnaires.
  13. 13. Evaluation of Field Workers • Cost and Time. The interviewers can be compared in terms of the total cost (salary and expenses) per completed interview. • Response Rates. It is important to monitor response rates on a timely basis so that corrective action can be taken if these rates are too low. • Quality of Interviewing. To evaluate interviewers on the quality of interviewing, the supervisor must directly observe the interviewing process. • Quality of Data. The completed questionnaires of each interviewer should be evaluated for the quality of data.
  14. 14. Data Preparation Process Select Data Analysis Strategy Prepare Preliminary Plan of Data Analysis Check Questionnaire Edit Code Transcribe Clean Data Statistically Adjust the Data
  15. 15. Questionnaire Checking A questionnaire returned from the field may be unacceptable for several reasons. • Parts of the questionnaire may be incomplete. • The pattern of responses may indicate that the respondent did not understand or follow the instructions. • The responses show little variance. • One or more pages are missing. • The questionnaire is received after the preestablished cutoff date. • The questionnaire is answered by someone who does not qualify for participation.
  16. 16. Editing Treatment of Unsatisfactory Results • Returning to the Field – The questionnaires with unsatisfactory responses may be returned to the field, where the interviewers recontact the respondents. • Assigning Missing Values – If returning the questionnaires to the field is not feasible, the editor may assign missing values to unsatisfactory responses. • Discarding Unsatisfactory Respondents – In this approach, the respondents with unsatisfactory responses are simply discarded.
  17. 17. Coding Coding means assigning a code, usually a number, to each possible response to each question. The code includes an indication of the column position (field) and data record it will occupy. Coding Questions • Fixed field codes, which mean that the number of records for each respondent is the same and the same data appear in the same column(s) for all respondents, are highly desirable. • If possible, standard codes should be used for missing data. Coding of structured questions is relatively simple, since the response options are predetermined. • In questions that permit a large number of responses, each possible response option should be assigned a separate column.
  18. 18. Coding Guidelines for Coding Unstructured Questions: • Category codes should be mutually exclusive and collectively exhaustive. • Only a few (10% or less) of the responses should fall into the “other” category. • Category codes should be assigned for critical issues even if no one has mentioned them. • Data should be coded to retain as much detail as possible.
  19. 19. Codebook A codebook contains coding instructions and the necessary information about variables in the data set. A codebook generally contains the following information: • column number • record number • variable number • variable name • question number • instructions for coding
  20. 20. Coding Questionnaires • The respondent code and the record number appear on each record in the data. • The first record contains the additional codes: project code, interviewer code, date and time codes, and validation code. • It is a good practice to insert blanks between parts.
  21. 21. ID PREFER. QUALITY QUANTITY VALUE SERVICE INCOME 1 2 2 3 1 3 6 2 6 5 6 5 7 2 3 4 4 3 4 5 3 4 1 2 1 1 2 5 5 7 6 6 5 4 1 6 5 4 4 5 4 3 7 2 2 3 2 3 5 8 3 3 4 2 3 4 9 7 6 7 6 5 2 10 2 3 2 2 2 5 11 2 3 2 1 3 6 12 6 6 6 6 7 2 13 4 4 3 3 4 3 14 1 1 3 1 2 4 15 7 7 5 5 4 2 16 5 5 4 5 5 3 17 2 3 1 2 3 4 18 4 4 3 3 3 3 19 7 5 5 7 5 5 20 3 2 2 3 3 3 Restaurant Preference
  22. 22. SPSS Variable View of the Data
  23. 23. Codebook Excerpt Column Number Variable Number Variable Name Question Number Coding Instructions 1 1 ID 1 to 20 as coded 2 2 Preference 1 Input the number circled. 1=Weak Preference 7=Strong Preference 3 3 Quality 2 Input the number circled. 1=Poor 7=Excellent 4 4 Quantity 3 Input the number circled. 1=Poor 7=Excellent 5 5 Value 4 Input the number circled. 1=Poor 7=Excellent 6 6 Service 5 Input the number circled. 1=Poor 7=Excellent
  24. 24. Column Number Variable Number Variable Name Question Number Coding Instructions 7 7 Income 6 Input the number circled. 1 = Less than $20,000 2 = $20,000 to 34,999 3 = $35,000 to 49,999 4 = $50,000 to 74,999 5 = $75,000 to 99,999 6 = $100,00 or more Codebook Excerpt (Cont.)
  25. 25. Example of Questionnaire Coding Finally, in this part of the questionnaire we would like to ask you some background information for classification purposes. PART D Record #7 1. This questionnaire was answered by (29) 1. _____ Primarily the male head of household 2. _____ Primarily the female head of household 3. _____ Jointly by the male and female heads of household 2. Marital Status (30) 1. _____ Married 2. _____ Never Married 3. _____ Divorced/Separated/Widowed 3. What is the total number of family members living at home? _____ (31 - 32) 4. Number of children living at home: a. Under six years _____ (33) b. Over six years _____ (34) 5. Number of children not living at home _____ (35) 6. Number of years of formal education which you (and your spouse, if applicable) have completed. (please circle) College High School Undergraduate Graduate a. You 8 or less 9 10 11 12 13 14 15 16 17 18 19 20 21 22 or more (36-37) b. Spouse 8 or less 9 10 11 12 13 14 15 16 17 18 19 20 21 22 or more (37-38) 7. a. Your age: (40-41) b. Age of spouse (if applicable) (42-43) 8. If employed please indicate your household's occupations by checking the appropriate category. 44 45 Male Head Female Head 1. Professional and technical 2. Managers and administrators 3. Sales workers 4. Clerical and kindred workers 5. Craftsman/operative /laborers 6. Homemakers 7. Others (please specify) 8. Not applicable 9. Is your place of residence presently owned by household? (46) 1. Owned _____ 2. Rented _____ 10. How many years have you been residing in the greater Atlanta area? years. (47-48)
  26. 26. Data Transcription Transcribed Data CATI/ CAPI Keypunching via CRT Terminal Digital Tech. Optical Recognition Bar Code & Other Technologies Verification: Correct Keypunching Errors Disks Other Storage Computer Memory Raw Data
  27. 27. Data Cleaning Consistency Checks Consistency checks identify data that are out of range, logically inconsistent, or have extreme values. • Computer packages like SPSS, SAS, EXCEL and MINITAB can be programmed to identify out-of-range values for each variable and print out the respondent code, variable code, variable name, record number, column number, and out-of-range value. • Extreme values should be closely examined.
  28. 28. Data Cleaning Treatment of Missing Responses • Substitute a Neutral Value – A neutral value, typically the mean response to the variable, is substituted for the missing responses. • Substitute an Imputed Response – The respondents' pattern of responses to other questions are used to impute or calculate a suitable response to the missing questions. • In casewise deletion, cases, or respondents, with any missing responses are discarded from the analysis. • In pairwise deletion, instead of discarding all cases with any missing values, the researcher uses only the cases or respondents with complete responses for each calculation.
  29. 29. Selecting a Data Analysis Strategy Known Characteristics of the Data Data Analysis Strategy Properties of Statistical Techniques Background and Philosophy of the Researcher
  30. 30. Measures of Center and Location Center and Location Mean Median Mode Weighted Mean N x n x x N i i n i i        1 1        i i i W i i i W w x w w x w X Overview
  31. 31. Mean (Arithmetic Average) • The Mean is the arithmetic average of data values • Population mean • Sample mean n = Sample Size N = Population Size n x x x n x x n n i i         2 1 1 N x x x N x N N i i          2 1 1
  32. 32. Mean (Arithmetic Average) • The most common measure of central tendency • Mean = sum of values divided by the number of values • Affected by extreme values (outliers) (continued) 0 1 2 3 4 5 6 7 8 9 10 Mean = 3 0 1 2 3 4 5 6 7 8 9 10 Mean = 4 3 5 15 5 5 4 3 2 1       4 5 20 5 10 4 3 2 1      
  33. 33. Median • In an ordered array, the median is the “middle” number, i.e., the number that splits the distribution in half • The median is not affected by extreme values 0 1 2 3 4 5 6 7 8 9 10 Median = 3 0 1 2 3 4 5 6 7 8 9 10 Median = 3
  34. 34. Median • To find the median, sort the n data values from low to high (sorted data is called a data array) • Find the value in the i = (1/2)n position • The ith position is called the Median Index Point • If i is not an integer, round up to next highest integer (continued)
  35. 35. Median Example • Note that n = 13 • Find the i = (1/2)n position: i = (1/2)(13) = 6.5 • Since 6.5 is not an integer, round up to 7 • The median is the value in the 7th position: Md = 12 (continued) Data array: 4, 4, 5, 5, 9, 11, 12, 14, 16, 19, 22, 23, 24
  36. 36. Mode • A measure of location • The value that occurs most often • Not affected by extreme values • Used for either numerical or categorical data • There may be no mode • There may be several modes 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mode = 5 0 1 2 3 4 5 6 No Mode
  37. 37. Shape of a Distribution • Describes how data is distributed • Symmetric or skewed Mean = Median Mean < Median Median < Mean Right-Skewed Left-Skewed Symmetric (Longer tail extends to left) (Longer tail extends to right)
  38. 38. Weighted Mean • Used when values are grouped by frequency or relative importance Days to Complete Frequency 5 4 6 12 7 8 8 2 Example: Sample of 26 Repair Projects Weighted Mean Days to Complete: days 6.31 26 164 2 8 12 4 8) (2 7) (8 6) (12 5) (4 w x w X i i i W                
  39. 39. • Mean is generally used, unless extreme values (outliers) exist • Then Median is often used, since the median is not sensitive to extreme values. • Example: Median home prices may be reported for a region – less sensitive to outliers Which measure of location is the “best”?
  40. 40. Measures of Central Tendency: Ungrouped Data • Measures of central tendency yield information about “particular places or locations in a group of numbers.” • Common Measures of Location • Mode • Median • Mean
  41. 41. Mean of Grouped Data • average of class midpoints • Class frequencies                      fM f fM N f M f M f M f M f f f f i i i 1 1 2 2 3 3 1 2 3
  42. 42. Calculation of Grouped Mean Class Interval Frequency Class Midpoint fM 20-under 30 6 25 150 30-under 40 18 35 630 40-under 50 11 45 495 50-under 60 11 55 605 60-under 70 3 65 195 70-under 80 1 75 75 50 2150       fM f 2150 50 43 0 .
  43. 43. Median of Grouped Data
  44. 44. Mode of Grouped Data • Midpoint of the modal class • Modal class has the greatest frequency Class Interval Frequency 20-under 30 6 30-under 40 18 40-under 50 11 50-under 60 11 60-under 70 3 70-under 80 1
  45. 45. Mean, Median and Mode • Q. The frequency distribution below represents the weights in pounds of a sample of packages carried last month by a small airfreight company. Class 10-10.9 11-11.9 12-12.9 13-13.9 14-14.9 15-15.9 16-16.9 17-17.9 18-18.9 19-19.9 Frequency 1 4 6 8 12 11 8 7 6 2 Compute sample mean, median and mode.
  46. 46. • The frequency distribution represents the salary (in Rupees) of an MNC employees for last year. Mean, Median and Mode Class (Rupee in hundreds) 0– 49.99 50.00– 99.99 100.00– 149.99 150.00– 199.99 200.00– 249.99 250.00– 299.99 300.00– 349.99 350.00– 399.99 400.00– 449.99 450.00– 499.99 Frequency 78 123 187 82 51 47 13 9 6 4 Compute mean, median and mode salary.
  47. 47. Measures of Variation Variation Variance Standard Deviation Population Variance Sample Variance Population Standard Deviation Sample Standard Deviation Range
  48. 48. • Measures of variation give information on the spread or variability of the data values. Variation Same center, different variation
  49. 49. Range • Simplest measure of variation • Difference between the largest and the smallest observations: Range = xmaximum – xminimum 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Range = 14 - 1 = 13 Example:
  50. 50. • Ignores the way in which data are distributed • Sensitive to outliers 7 8 9 10 11 12 Range = 12 - 7 = 5 7 8 9 10 11 12 Range = 12 - 7 = 5 Disadvantages of the Range 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120 Range = 5 - 1 = 4 Range = 120 - 1 = 119
  51. 51. • Average of squared deviations of values from the mean • Population variance: • Sample variance: Variance N μ) (x σ N 1 i 2 i 2     1 - n ) x (x s n 1 i 2 i 2    
  52. 52. Standard Deviation • Most commonly used measure of variation • Shows variation about the mean • Has the same units as the original data • Population standard deviation: • Sample standard deviation: N μ) (x σ N 1 i 2 i     1 - n ) x (x s n 1 i 2 i    
  53. 53. Introduction to Probability Distributions • Random Variable • Represents a possible numerical value from a random event • Takes on different values based on chance Random Variables Discrete Random Variable Continuous Random Variable
  54. 54. • A discrete random variable is a variable that can assume only a countable number of values Many possible outcomes: • number of complaints per day • number of TV’s in a household • number of rings before the phone is answered Only two possible outcomes: • gender: male or female • defective: yes or no • spreads peanut butter first vs. spreads jelly first Discrete Random Variable
  55. 55. Continuous Random Variable • A continuous random variable is a variable that can assume any value on a continuum (can assume an uncountable number of values) • thickness of an item • time required to complete a task • temperature of a solution • height, in inches • These can potentially take on any value, depending only on the ability to measure accurately.
  56. 56. Discrete Random Variables • Can only assume a countable number of values Examples: • Roll a die twice Let x be the number of times 4 comes up (then x could be 0, 1, or 2 times) • Toss a coin 5 times. Let x be the number of heads (then x = 0, 1, 2, 3, 4, or 5)
  57. 57. Experiment: Toss 2 Coins. Let x = # heads. T T Discrete Probability Distribution 4 possible outcomes T T H H H H Probability Distribution 0 1 2 x x Value Probability 0 1/4 = .25 1 2/4 = .50 2 1/4 = .25 .50 .25 Probability
  58. 58. Probability Distributions Continuous Probability Distributions Binomial Poisson Probability Distributions Discrete Probability Distributions Normal
  59. 59. Continuous Probability Distributions • A continuous random variable is a variable that can assume any value on a continuum (can assume an uncountable number of values) • thickness of an item • time required to complete a task • temperature of a solution • height, in inches • These can potentially take on any value, depending only on the ability to measure accurately.
  60. 60. Factor Analysis • Factor analysis is a general name denoting a class of procedures primarily used for data reduction and summarization. • Factor analysis is an interdependence technique in that an entire set of interdependent relationships is examined without making the distinction between dependent and independent variables. • Factor analysis is used in the following circumstances: • To identify underlying dimensions, or factors, that explain the correlations among a set of variables. • To identify a new, smaller, set of uncorrelated variables to replace the original set of correlated variables in subsequent multivariate analysis (regression or discriminant analysis). • To identify a smaller set of salient variables from a larger set for use in subsequent multivariate analysis.
  61. 61. Factor Analysis Model Mathematically, each variable is expressed as a linear combination of underlying factors. The covariation among the variables is described in terms of a small number of common factors plus a unique factor for each variable. If the variables are standardized, the factor analysis model may be represented as: Xi = Ai 1F1 + Ai 2F2 + Ai 3F3 + . . . + AimFm + ViUi where Xi = i th standardized variable Aij = standardized multiple regression coefficient of variable i on common factor j F = common factor Vi = standardized regression coefficient of variable i on unique factor i Ui = the unique factor for variable i m = number of common factors
  62. 62. Factor Analysis Model The unique factors are uncorrelated with each other and with the common factors. The common factors themselves can be expressed as linear combinations of the observed variables. Fi = Wi1X1 + Wi2X2 + Wi3X3 + . . . + WikXk Where: Fi = estimate of i th factor Wi = weight or factor score coefficient k = number of variables
  63. 63. Factor Analysis Model • It is possible to select weights or factor score coefficients so that the first factor explains the largest portion of the total variance. • Then a second set of weights can be selected, so that the second factor accounts for most of the residual variance, subject to being uncorrelated with the first factor. • This same principle could be applied to selecting additional weights for the additional factors.
  64. 64. Statistics Associated with Factor Analysis • Bartlett's test of sphericity. Bartlett's test of sphericity is a test statistic used to examine the hypothesis that the variables are uncorrelated in the population. In other words, the population correlation matrix is an identity matrix; each variable correlates perfectly with itself (r = 1) but has no correlation with the other variables (r = 0). • Correlation matrix. A correlation matrix is a lower triangle matrix showing the simple correlations, r, between all possible pairs of variables included in the analysis. The diagonal elements, which are all 1, are usually omitted.
  65. 65. Statistics Associated with Factor Analysis • Communality. Communality is the amount of variance a variable shares with all the other variables being considered. This is also the proportion of variance explained by the common factors. (0.5) • Eigenvalue. The eigenvalue represents the total variance explained by each factor. >1 • Factor loadings. Factor loadings are simple correlations between the variables and the factors. >.5 • Factor loading plot. A factor loading plot is a plot of the original variables using the factor loadings as coordinates. • Factor matrix. A factor matrix contains the factor loadings of all the variables on all the factors extracted.
  66. 66. Statistics Associated with Factor Analysis • Factor scores. Factor scores are composite scores estimated for each respondent on the derived factors. • Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy. The Kaiser-Meyer- Olkin (KMO) measure of sampling adequacy is an index used to examine the appropriateness of factor analysis. High values (between 0.5 and 1.0) indicate factor analysis is appropriate. Values below 0.5 imply that factor analysis may not be appropriate. • Percentage of variance. The percentage of the total variance attributed to each factor. >60% • Scree plot. A scree plot is a plot of the Eigenvalues against the number of factors in order of extraction. • Eigen value >=1
  67. 67. Conducting Factor Analysis Construction of the Correlation Matrix Method of Factor Analysis Determination of Number of Factors Determination of Model Fit Problem Formulation Calculation of Factor Scores Interpretation of Factors Rotation of Factors Selection of Surrogate Variables
  68. 68. Conducting Factor Analysis: Formulate the Problem • The objectives of factor analysis should be identified. • The variables to be included in the factor analysis should be specified based on past research, theory, and judgment of the researcher. It is important that the variables be appropriately measured on an interval or ratio scale. • An appropriate sample size should be used. As a rough guideline, there should be at least four or five times as many observations (sample size) as there are variables.
  69. 69. Correlation Matrix Variables V1 V2 V3 V4 V5 V6 V1 1.000 V2 -0.530 1.000 V3 0.873 -0.155 1.000 V4 -0.086 0.572 -0.248 1.000 V5 -0.858 0.020 -0.778 -0.007 1.000 V6 0.004 0.640 -0.018 0.640 -0.136 1.000
  70. 70. Conducting Factor Analysis: Construct the Correlation Matrix • The analytical process is based on a matrix of correlations between the variables. • Bartlett's test of sphericity can be used to test the null hypothesis that the variables are uncorrelated in the population: in other words, the population correlation matrix is an identity matrix. If this hypothesis cannot be rejected, then the appropriateness of factor analysis should be questioned. • Another useful statistic is the Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy. Small values of the KMO statistic indicate that the correlations between pairs of variables cannot be explained by other variables and that factor analysis may not be appropriate.
  71. 71. Determine the Method of Factor Analysis • In principal components analysis, the total variance in the data is considered. The diagonal of the correlation matrix consists of unities, and full variance is brought into the factor matrix. Principal components analysis is recommended when the primary concern is to determine the minimum number of factors that will account for maximum variance in the data for use in subsequent multivariate analysis. The factors are called principal components. • In common factor analysis, the factors are estimated based only on the common variance. Communalities are inserted in the diagonal of the correlation matrix. This method is appropriate when the primary concern is to identify the underlying dimensions and the common variance is of interest. This method is also known as principal axis factoring.
  72. 72. Scree Plot 0.5 2 5 4 3 6 Component Number 0.0 2.0 3.0 Eigenvalue 1.0 1.5 2.5 1
  73. 73. A Classification of Univariate Techniques Independent Related Independent Related * Two- Group test * Z test * One-Way ANOVA * Paired t test * Chi-Square * Mann-Whitney * Median * K-S * K-W ANOVA * Sign * Wilcoxon * McNemar * Chi-Square Metric Data Non-numeric Data Univariate Techniques One Sample Two or More Samples One Sample Two or More Samples * t test * Z test * Frequency * Chi-Square * K-S * Runs * Binomial
  74. 74. A Classification of Multivariate Techniques More Than One Dependent Variable * Multivariate Analysis of Variance * Canonical Correlation * Multiple Discriminant Analysis * Structural Equation Modeling and Path Analysis * Cross-Tabulation * Analysis of Variance and Covariance * Multiple Regression * 2-Group Discriminant/Logit * Conjoint Analysis * Factor Analysis * Confirmatory Factor Analysis One Dependent Variable Variable Interdependence Interobject Similarity * Cluster Analysis * Multidimensional Scaling Dependence Technique Interdependence Technique Multivariate Techniques
  75. 75. Correlation • The correlation, r, summarizes the strength of association between two metric (interval or ratio scaled) variables, say X and Y. • It is an index used to determine whether a linear or straight-line relationship exists between X and Y. • As it was originally proposed by Karl Pearson, it is also known as the Pearson correlation coefficient. It is also referred to as simple correlation, bivariate correlation, or merely the correlation coefficient.
  76. 76. Factors influences correlation • Chance coincidence • Influence of third variable • Mutual influence
  77. 77. Types of correlations • Positive/Negative correlation • Linear/Non-linear correlation • Simple/partial/multiple correlation • Simple correlation: x&y • Partial correlation: x&y where z is constant • Multiple correlation: more than 3 variables.
  78. 78. Methods of correlation analysis • Scatter plot • Karl-Pearson correlation • Rank Correlation • Method of least square
  79. 79. Correlation • r varies between -1.0 and +1.0. • The correlation coefficient between two variables will be the same regardless of their underlying units of measurement.
  80. 80. Karl Pearson Coefficient of Correlation • Formula
  81. 81. Calculate correlation coefficient (Karl Pearson coefficient of correlation) • Find correlation between unemployed and index of production? • Ans: r= Year Index of production Number unemployed (in lakhs) 1991 100 15 1992 102 12 1993 104 13 1994 107 11 1995 105 12 1996 112 12 1997 103 19 1998 99 26
  82. 82. Calculate correlation coefficient (Karl Pearson coefficient of correlation) • Find correlation between Age and no. of sick days? • Ans: r= Employee Age sick days 1 30 1 2 32 0 3 35 2 4 40 5 5 48 2 6 50 4 7 52 6 8 55 5 9 57 7 10 61 8
  83. 83. Spearman's Rank Correlation Where: Ρ=rank correlation coefficient di =difference between two ranks of each observation n= number of observations
  84. 84. Rank correlation of following Year Index of production (x) Number unemployed (in lakhs) (y) 1991 100 15 1992 102 12 1993 104 13 1994 107 11 1995 105 12 1996 112 12 1997 103 19 1998 99 26 Employee Age (x) sick days (y) 1 30 1 2 32 0 3 35 2 4 40 5 5 48 2 6 50 4 7 52 6 8 55 5 9 57 7 10 61 8 ρ= ρ=
  85. 85. Regression Analysis Regression analysis examines associative relationships between a metric dependent variable and one or more independent variables in the following ways: • Determine whether the independent variables explain a significant variation in the dependent variable: whether a relationship exists. • Determine how much of the variation in the dependent variable can be explained by the independent variables: strength of the relationship. • Determine the structure or form of the relationship: the mathematical equation relating the independent and dependent variables. • Predict the values of the dependent variable. • Control for other independent variables when evaluating the contributions of a specific variable or set of variables. • Regression analysis is concerned with the nature and degree of association between variables and does not imply or assume any causality.
  86. 86. Formulas • y=mx+b • Y=dependent variable • X= independent variable • b= intercept • m=slope of line Slope of line Line intercept
  87. 87. Linear regression of following Year Index of production (x) Number unemployed (in lakhs) (y) 1991 100 15 1992 102 12 1993 104 13 1994 107 11 1995 105 12 1996 112 12 1997 103 19 1998 99 26 Employee Age (x) sick days (y) 1 30 1 2 32 0 3 35 2 4 40 5 5 48 2 6 50 4 7 52 6 8 55 5 9 57 7 10 61 8

×