Your SlideShare is downloading. ×
  • Like
Spss analysis conjoint_cluster_regression_pca_discriminant
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Spss analysis conjoint_cluster_regression_pca_discriminant

  • 753 views
Published

 

Published in Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
753
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
28
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Conjoint Analysis :Conjoint Analysis is a marketing research technique designed to help determine preferences ofcustomers. It is used to analyse how customers value different attributes of a product ( or service)and thus gives an insight into the trade-offs they are to make among the various attributes. To putsimply, it tells how much each feature of a product is worth to the consumers.This study includes surveying people with a certain set of attribute combinations which the survey-takers rank or provide preferences. Analysis will be done to model the customer preferences fordifferent combination of attributes. The attributes are termed factors and the different values arelevels.In the example that we have taken to use Conjoint Analysis through the tool SPSS, we have analyseddata on carpet, taking attributes like Price, Brand, Money-return, Package design and Seal as theattributes based on which the consumers give prefernces. Using two data sets, we calculate the partworths and decide on the weightage of each of the attributes that the users have provided.Variable name Variable label Value labelpackage package design A*, B*, C*brand brand name K2R, Glory, Bissellprice price $1.19, $1.39, $1.59seal Good Housekeeping seal no, yesmoney money-back guarantee no, yesCode to fetch import the data and analyse :GET FILE=C:UsersAbhiDesktopcarpet_plan.sav.DATASET NAME DataSet1 WINDOW=FRONT.GET FILE=C:UsersAbhiDesktopcarpet_prefs.sav.DATASET NAME DataSet2 WINDOW=FRONT.CONJOINT PLAN=C:UsersAbhiDesktopcarpet_plan.sav /DATA=C:UsersAbhiDesktopcarpet_prefs.sav /SEQUENCE=PREF1 PREF2 PREF3 PREF4 PREF5 PREF6 PREF7 PREF8 PREF9 PREF10 PREF11 PREF12 PREF13 PREF14 PREF15 PREF16 PREF17 PREF18 P REF19 PREF20 PREF21 PREF22 /SUBJECT=ID /FACTORS=PACKAGE BRAND (DISCRETE) PRICE (LINEAR LESS) SEAL (LINEAR MORE) MONEY (LINEAR MORE) /PRINT=SUMMARYONLY.
  • 2. Model Description Relation to Ranks N of Levels or Scorespackage 3 Discretebrand 3 Discreteprice 3 Linear (less)seal 2 Linear (more)money 2 Linear (more)Calculation of the part-worth of each attribute Utilities Utility Estimate Std. Errorpackage A* -2.233 .192 B* 1.867 .192 C* .367 .192brand K2R .367 .192 Glory -.350 .192 Bissell -.017 .192price $1.19 -6.595 .988 $1.39 -7.703 1.154 $1.59 -8.811 1.320seal no 2.000 .287 yes 4.000 .575money no 1.250 .287
  • 3. yes 2.500 .575(Constant) 12.870 1.282This table shows the utility (part-worth) scores and their standard errors for each factor level. Higherutility values indicate greater preference. We can see that the value of the part worths are such that,for each attribute if part-worths are added for different levels, it sums up to zero. Thus with respectto brand Glory and Bisell, K2R is preferred more. As expected, there is an inverse relationshipbetween price and utility, with higher prices corresponding to lower utility. The presence of a seal ofapproval or money-back guarantee corresponds to a higher utility.Also, total utility of a combinationcan be calculated as :If the cleaner had package design C*, brand Bissell, price $1.59, a seal of approval, and a money -backguarantee, the total utility would be:0.367 + (−0.017) + (−8.811) + 4.000 + 2.500 + 12.870 = 10.909.Importance: Importance Valuespackage 35.635brand 14.911price 29.410seal 11.172money 8.872We can see that attributes package has most importance followed by price. Money return is of leastconcern for the consumer. The values are computed by taking the utility range for each factorseparately and dividing by the sum of the utility ranges for al l factors. The values thus representpercentages and have the property that they sum to 100.
  • 4. Coefficients B Coefficient Estimateprice -5.542seal 2.000money 1.250The utility for a particular factor level is determined by multiplying the level by the coefficient. Forexample, the predicted utility for a price of $1.19 was listed as −6.595 in the utilities table. This issimply the value of the price level, 1.19, multiplied by the price coefficient, −5.542.This table provides measures of the correlation between the observed and estimated preferences. Preference Scores of Simulations aCardNumber ID Score1 1 10.2582 2 14.292
  • 5. The real power of conjoint analysis is the ability to predict preference for product profiles thatwerent rated by the subjects. These are referred to as simulation cases. b Preference Probabilities of SimulationsCard Bradley-Terry- aNumber ID Maximum Utility Luce Logit1 1 30.0% 43.1% 30.9%2 2 70.0% 56.9% 69.1%The maximum utility model determines the probability as the numberof respondents predicted to choose the profile divided by the totalnumber of respondents. For each respondent, the predicted choice issimply the profile with the largest total utility.Number of ReversalsFactor price 3 money 2 seal 2 brand 0 package 0Subject 1 Subject 1 1 2 Subject 2 2 3 Subject 3 0 4 Subject 4 0 5 Subject 5 0 6 Subject 6 1
  • 6. 7 Subject 7 0 8 Subject 8 0 9 Subject 9 1 10 Subject 10 2This table displays the number of reversals for each factor and for each subject. For example, threesubjects showed a reversal for price. That is, they preferred product profiles with higher prices. Reversal SummaryN ofReversals N of Subjects1 32 2
  • 7. Q. Perform Discriminant Analysis on the given dataset.The dataset chosen contains statistics on set of people who have been given bank loans & have defaulted or not defaulted with their various characteristics.Discriminant NotesOutput Created 04-Apr-2013 18:39:05 p{color:black;font-family:sans-serif;font-size:10pt;font-Comments weight:normal}Input Data E:VGSOMSTUDYSECOND Your trial period for SPSS for Windows will expire in 14 da SEMBRMSPSS16Samplesbanklo ys.p{color:0;font-family:Monospaced;font-size:13pt;font- style:normal;font-weight:normal;text-decoration:none} an.sav GET Active Dataset DataSet1 FILE=E:VGSOMSTUDYSECOND SEMBRMSPSS16Samplesbanklo an.sav. File Label Bank Loan Default DATASET NAME DataSet1 WINDOW=FRONT. Filter <none> DISCRIMINANT /GROUPS=default(0 1) Weight <none> /VARIABLES=employ address age Split File <none> /ANALYSIS ALL /PRIORS EQUAL N of Rows in Working /STATISTICS=MEAN STDDEV UNIVF BOXM COEFF CORR TABLE 850 Data File /PLOT=COMBINED /PLOT=CASESMissing Value Handling Definition of Missing User-defined missing values are treated as missing in the analysis /CLASSIFY=NONMISSING POOLED MEANSUB. phase. Cases Used In the analysis phase, cases with no user- or system-missing values for any predictor variable are used. Cases with user-, system-missing, or out-of-range values for the grouping variable are always excluded.Syntax DISCRIMINANT /GROUPS=default(0 1) /VARIABLES=employ address age /ANALYSIS ALL /PRIORS EQUAL /STATISTICS=MEAN STDDEV UNIVF BOXM COEFF CORR TABLE /PLOT=COMBINED /PLOT=CASES /CLASSIFY=NONMISSING POOLED MEANSUB.Resources Processor Time 00:00:00.047 [DataSet1] E:VGSOMSTUDYSECOND SEMBRMSPSS16Samplesbankloan.sav Elapsed Time 00:00:00.121
  • 8. WarningsAll-Groups Stacked Histogram is no longer displayed. Analysis Case Processing SummaryUnweighted Cases N PercentValid 700 82.4Excluded Missing or out-of-range 150 17.6 group codes At least one missing 0 .0 discriminating variable Both missing or out-of- range group codes and at 0 .0 least one missing discriminating variable Total 150 17.6Total 850 100.0 Group Statistics Valid N (listwise)Previously defaulted Mean Std. Deviation Unweighted WeightedNo Years with current 9.51 6.664 517 517.000 employer Years at current address 8.95 7.001 517 517.000 Age in years 35.51 7.708 517 517.000Yes Years with current 5.22 5.543 183 183.000 employer Years at current address 6.39 5.925 183 183.000 Age in years 33.01 8.518 183 183.000Total Years with current 8.39 6.658 700 700.000 employer Years at current address 8.28 6.825 700 700.000 Age in years 34.86 7.997 700 700.000
  • 9. Tests of Equality of Group Means Wilks Lambda F df1 df2 Sig.Years with current .920 60.759 1 698 .000employerYears at current address .973 19.402 1 698 .000Age in years .981 13.482 1 698 .000 Pooled Within-Groups Matrices Years with current Years at employer current address Age in years This matrix shows correlation between the predictors. The largestCorrelation Years with current correlations occur between Credit card debt in thousands and the 1.000 .292 .524 other variables. employer Years at current address .292 1.000 .588 Age in years .524 .588 1.000Analysis 1Boxs Test of Equality of Covariance Matrices Log Determinants LogPreviously defaulted Rank DeterminantNo 3 11.012Yes 3 10.501Pooled within-groups 3 10.919The ranks and natural logarithms of determinantsprinted are those of the group covariancematrices. Test ResultsBoxs M 28.171F Approx. 4.665 df1 6 df2 7.335E5 Sig. .000
  • 10. Log Determinants LogPreviously defaulted Rank DeterminantNo 3 11.012Yes 3 10.501Pooled within-groups 3 10.919Tests null hypothesis ofequal population covariancematrices.Summary of Canonical Discriminant Functions EigenvaluesFunctio Canonicaln Eigenvalue % of Variance Cumulative % Correlation1 .100a 100.0 100.0 .301a. First 1 canonical discriminant functions were used in the analysis. Wilks LambdaTest ofFunction(s) Wilks Lambda Chi-square df Sig.1 .909 66.251 3 .000 Standardized Canonical Discriminant Function Coefficients Function 1Years with current .980employerYears at current address .436Age in years -.330
  • 11. Structure Matrix Function 1Years with current .934employerYears at current address .528Age in years .440Pooled within-groups correlationsbetween discriminating variables andstandardized canonical discriminantfunctions Variables ordered by absolute size ofcorrelation within function. Functions at Group CentroidsPrevio Functionuslydefaulted 1No .188Yes -.530Unstandardizedcanonicaldiscriminantfunctions evaluatedat group meansClassification Statistics Classification Processing SummaryProcessed 850Excluded Missing or out-of-range 0 group codes At least one missing 0 discriminating variableUsed in Output 850
  • 12. Prior Probabilities for GroupsPrevio Cases Used in Analysisuslydefaulted Prior Unweighted WeightedNo .500 517 517.000Yes .500 183 183.000Total 1.000 700 700.000 Classification Function Coefficients Previously defaulted No YesYears with current -.192 -.302employerYears at current address -.302 -.348Age in years .797 .827(Constant) -12.588 -12.444Fishers linear discriminant functions Classification Resultsa Previously Predicted Group Membership defaulted No Yes Total The Discriminant Analysis shows that the persons in the categoryOriginal Count No 300 217 517 who have previously defaulted are predicted likely to default this Yes 44 139 183 time as well & those who haven’t defaulted earlier are predicted less Ungrouped cases 81 69 150 likely to default this time. % No 58.0 42.0 100.0 The conclusion is inferred from the total no. of defaulters being Yes 24.0 76.0 100.0 more than non defaulters (139>44) similarly (300>217). Ungrouped cases 54.0 46.0 100.0a. 62.7% of original grouped cases correctly classified.
  • 13. Q. Perform Factor Analysis on the given dataset.The dataset chosen contains fictional statistics anxiety questionnaire. It contains response givenby students regarding their ease of use, liking and usage of SPSS in statistics.By using the Scree Plot I have chosen 5 factors.Since a student may give related answers depending upon the choices hence I considered thevariables to be inter-related and hence used Oblimin rotation. Say a student gave high points forvariable “I have little experience of computers” is likely to give high points for “All computershate me” as the variables are correlated somewhat.
  • 14. Using the options of SPSS the following Pattern Matrix was generated. Pattern Matrix a Component 1 2 3 4 5I have little experience of .903computersSPSS always crashes when I .732try to use itAll computers hate me .684I worry that I will cause .662irreparable damage becauseof my incompetenece withcomputersComputers have minds of .581their own and deliberately gowrong whenever I use themPeople try to tell you that .446SPSS makes statistics easierto understand but it doesntComputers are out to get me .333My friends are better at SPSS .661than I amMy friends are better at .655statistics than meIf Im good at statistics my .622friends will think Im a nerdMy friends will think Im stupid .504 .330for not being able to copewith SPSSEverybody looks at me when .358 .358I use SPSSI cant sleep for thoughts of -.728eigen vectorsI wake up under my duvet .324 -.543thinking that I am trappedunder a normal distribtion
  • 15. Computers are useful only for .359 .393 -.366playing gamesStandard deviations excite .301 .356 .315meI have never been good at -.855mathematicsI did badly at mathematics at -.736schoolI slip into a coma whenever I -.722see an equationStatiscs makes me cry -.772I dont understand statistics -.730I weep openly at the mention -.664of central tendencyI dream that Pearson is -.564attacking me with correlationcoefficientsExtraction Method: Principal Component Analysis.Rotation Method: Oblimin with Kaiser Normalization.a. Rotation converged in 15 iterations.The total variance explained by each factor is given belowTotal Variance Explained Rotation Sums of Squared Loadings aComponent Total1 5.5222 2.4523 2.3834 3.5355 4.913
  • 16. Extraction Method:Principal ComponentAnalysis.It is calculated by the sum of squared loadings of the factor and dividing the sum of squared loadings bythe number of variables and multiplying by 100.Hence the factoring would be as follows depending on the loading values. Factor Variable Nos. 1 1,2,3,4,5,6,7,14 2 8,9,10 3 13 4 17,18,19 5 20,21,22,23Since variables 11, 12, 15 and 16 have very close loadings in different factors it is not good as thisvariable is assessing both constructs.15 has exact same value in both Factor 2 and Factor 3.These aresaid to have split loading.They are hence mentioned in a separately. Factor Variable No 2 11,16,15 3 12,15As Split loading is present this is not a simple structure.Factor 1: Anxiety about the usage of computers accounts for 55.22% of the total variance and loads 8 ofthe variables.Factor 2: View of students regarding their understanding of statistics and SPSS with regard to their peersaccounts for 24.52% of the total variance and loads 3 variables. It also split loads variable 11, 16 and 15.Factor 3: Anxiety about Eigen vectors corresponds to only 23.83% of the total variance and loads only 1variable directly while it split loads variable 12 and 15.Factor 4: Students interest in mathematics accounts for 35.35% of the total variance and loads 3variable.Factor 5: Dislike for statistics accounts for 49.13% of the total variance and loads 4 variables.
  • 17. CLUSTER ANALYSISCluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the samegroup (called cluster) are more similar (in some sense or another) to each other than to those in other groups(clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysisused in many fields, including machine learning, pattern recognition, image analysis, information retrieval, andbioinformatics.Proximities NotesOutput Crea ted 02-Apr-2013 22:00:05CommentsInput Da ta C:Us ers dev ma l etiaDownloadsClusterAnonFaculty.sav Acti ve Da taset Da ta Set3 Fi l ter <none> Wei ght <none> Spl it File <none> N of Rows i n Working Data File 44Mi s sing Value Handling Defi nition of Missing Us er-defined missing values a re treated as missing. Ca s es Used Sta ti stics a re based on cases with no missing values for a ny va ri able used.Synta x PROXIMITIES Sa l ary FTE Ra nk Arti cles Experience /MATRIX OUT(C:Us ersDEVMAL~1AppDataLocalTempspss 6496s pssclus.tmp) /VIEW=CASE /MEASURE=SEUCLID /PRINT NONE /ID=Name /STANDARDIZE=VARIABLE Z.Res ources Proces sor Ti me 00:00:00.078 El a psed Time 00:00:00.082 Works pace Bytes 11152Fi l es Saved Ma tri x Fi le C:Us ers DEVMAL~1AppDataLocalTempspss6496 s pssclus.tmp
  • 18. The variables are which I have used in the dataset are as follows:• Name -- Although faculty salaries are public information under North Carolina state law• Salary – annual salary in dollars, from the university report available in One Stop.• FTE – Full time equivalent work load for the faculty member.• Rank – where 1 = adjunct, 2 = visiting, 3 = assistant, 4 = associate, 5 = professor• Articles – number of published scholarly articles, excluding things like comments in newsletters,abstracts in proceedings, and the like.• Experience – Number of years working as a full time faculty member in a Department of Psychology.• ArticlesAPD – number of published articles as listed in the university’s Academic Publications• Sex –biological sex from physical appearance.In the first step SPSS computes for each pair of cases the squared Euclidian distance between the cases. This isquite simply, the sum across variables (from i = 1 to v) of the squared difference between the score on variablei for the one case (Xi) and the score on variable i for the other case (Yi). The two cases which are separated bythe smallest Euclidian distance are identified and then classified together into the first cluster. At this pointthere is one cluster with two cases in it.Next SPSS re-computes the squared Euclidian distances between each entity (case or cluster) and each otherentity. When one or both of the compared entities is a cluster, SPSS computes the averaged squared Euclidiandistance between members of the one entity and members of the other entity. The two entities with thesmallest squared Euclidian distance are classified together. SPSS then re-computes the squared Euclidiandistances between each entity and each other entity and the two with the smallest squared Euclidian distanceare classified together. This continues until all of the cases have been clustered into one big cluster.The output obtained can be seen below: Case Processing Summary a Ca s es Va l i d Mi s s i ng Tota l N Percent N Percent N Percent 44 100.0% 0 .0% 44 100.0% a. Squa red Euclidean Distance used
  • 19. On the first step SPSS clustered case 32 with 33. The squared Euclidian distance between these two cases is0.000. At stages 2-4 SPSS creates three more clusters, each containing two cases. At stage 5 SPSS adds case39 to the cluster that already contains cases 37 and 38. By the 43rd stage all cases have been clustered intoone entity.The results can be seen below:Average Linkage (Between Groups) Agglomeration Schedule Cl us ter Combi ned Sta ge Cl us ter Fi rs t Appea rsSta ge Cl us ter 1 Cl us ter 2 Coeffi ci ents Cl us ter 1 Cl us ter 2 Next Sta ge1 32 33 .000 0 0 92 41 42 .000 0 0 63 43 44 .000 0 0 64 37 38 .000 0 0 55 37 39 .001 4 0 76 41 43 .002 2 3 277 36 37 .003 0 5 278 20 22 .007 0 0 119 30 32 .012 0 1 1310 21 26 .012 0 0 1411 20 25 .031 8 0 1212 16 20 .055 0 11 1413 29 30 .065 0 9 2614 16 21 .085 12 10 2015 11 18 .093 0 0 2216 8 9 .143 0 0 2517 17 24 .144 0 0 2018 13 23 .167 0 0 2219 14 15 .232 0 0 3220 16 17 .239 14 17 2321 7 12 .279 0 0 2822 11 13 .441 15 18 2923 16 27 .451 20 0 2624 3 10 .572 0 0 2825 6 8 .702 0 16 3626 16 29 .768 23 13 3527 36 41 .858 7 6 33
  • 20. 28 3 7 .904 24 21 3129 11 28 .993 22 0 3030 5 11 1.414 0 29 3431 3 4 1.725 28 0 3632 14 31 1.928 19 0 3433 36 40 2.168 27 0 4034 5 14 2.621 30 32 3535 5 16 2.886 34 26 3736 3 6 3.089 31 25 3837 5 19 4.350 35 0 3938 1 3 4.763 0 36 4139 5 34 5.593 37 0 4240 35 36 8.389 0 33 4341 1 2 8.961 38 0 4242 1 5 11.055 41 39 4343 1 35 17.237 42 40 0 Cluster MembershipCa s e 5 Cl us ters 4 Cl us ters 3 Cl us ters 2 Cl us ters1:Ros alyn 1 1 1 12:La wrence 2 2 1 13:Suni la 1 1 1 14:Ra ndolph 1 1 1 15:Mi ckey 3 3 2 16:Loui s 1 1 1 17:Tony 1 1 1 18:Ra ul 1 1 1 19:Ca ta l ina 1 1 1 110:Johns on 1 1 1 111:Beul ah 3 3 2 112:Ma rti na 1 1 1 113:Ma ri e 3 3 2 114:Ernes t 3 3 2 115:Chri s topher 3 3 2 116:Erni e 3 3 2 117:Chri s ta 3 3 2 1
  • 21. 18:Li nette 3 3 2 119:Bo 3 3 2 120:Ca rl a 3 3 2 121:Al berto 3 3 2 122:Chri s ti na 3 3 2 123:Jona h 3 3 2 124:Tucker 3 3 2 125:Sha nta 3 3 2 126:Mel i ssa 3 3 2 127:Jenna 3 3 2 128:Johnny 3 3 2 129:Cl ea tus 3 3 2 130:Jona s 3 3 2 131:Ta d 3 3 2 132:Ama ryl l is 3 3 2 133:Na tha n 3 3 2 134:Dea nna 3 3 2 135:Wi l ly 4 4 3 236:Dea na 5 4 3 237:Dea 5 4 3 238:Cl a ude 5 4 3 239:Ama nda 5 4 3 240:Bori s 5 4 3 241:Ga rrett 5 4 3 242:Stew 5 4 3 243:Bree 5 4 3 244:Ka rma 5 4 3 2Vertical Icicle:In this document, it is not possible to display the full vertical icicle, but, yet, the results for the same aredescribed below.For the two cluster solution you can see that one cluster consists of ten cases (Boris through Willy, followed bya column with no X’s). These were our adjunct (part-time) faculty (excepting one) and the second clusterconsists of everybody else.For the three cluster solution you can see the cluster of adjunct faculty and the others split into two. Deannathrough Mickey were our junior faculty and Lawrence through Rosalyn our senior facultyFor the four cluster solution you can see that one case (Lawrence) forms a cluster of his own.
  • 22. DendrogramIt displays essentially the same information that is found in the agglomeration schedule but in graphic form. * * * * * * * * * * * H I E R A R C H I C A L C L U S T E R A N A L Y S I S * * * * * * * * * * Dendrogram using Average Linkage (Between Groups) Rescaled Distance Cluster Combine C A S E 0 5 10 15 20 25 Label Num +---------+---------+---------+---------+---------+ Amaryllis 32 ─┐ Nathan 33 ─┤ Jonas 30 ─┼─┐ Cleatus 29 ─┘ │ Alberto 21 ─┐ │ Melissa 26 ─┤ │ Carla 20 ─┤ ├─────┐ Christina 22 ─┤ │ │ Shanta 25 ─┤ │ │ Ernie 16 ─┤ │ │ Christa 17 ─┼─┘ │ Tucker 24 ─┤ │ Jenna 27 ─┘ ├───┐ Beulah 11 ─┐ │ │ Linette 18 ─┼─┐ │ │ Marie 13 ─┤ ├─┐ │ │ Jonah 23 ─┘ │ ├─┐ │ │ Johnny 28 ───┘ │ │ │ ├───┐ Mickey 5 ─────┘ ├─┘ │ │ Ernest 14 ─┬───┐ │ │ │ Christopher 15 ─┘ ├─┘ │ ├───────────────┐ Tad 31 ─────┘ │ │ │ Bo 19 ─────────────┘ │ │ Deanna 34 ─────────────────┘ │ Raul 8 ─┬─┐ │ Catalina 9 ─┘ ├─────┐ ├───────────────┐ Louis 6 ───┘ │ │ │ Tony 7 ─┬─┐ ├───┐ │ │ Martina 12 ─┘ ├─┐ │ │ │ │ Sunila 3 ─┬─┘ ├───┘ ├───────────┐ │ │ Johnson 10 ─┘ │ │ │ │ │ Randolph 4 ─────┘ │ ├───────┘ │ Rosalyn 1 ─────────────┘ │ │ Lawrence 2 ─────────────────────────┘ │ Garrett 41 ─┐ │ Stew 42 ─┼─┐ │ Bree 43 ─┤ │ │ Karma 44 ─┘ ├───┐ │ Dea 37 ─┐ │ │ │ Claude 38 ─┤ │ ├─────────────────┐ │ Amanda 39 ─┼─┘ │ │ │ Deana 36 ─┘ │ ├───────────────────────┘ Boris 40 ───────┘ │ Willy 35 ─────────────────────────┘
  • 23. Multiple Regression AnalysisIn this Analysis we are using a data file that was created by randomly sampling 400 elementaryschools from the California Department of Educations API 2000 dataset. This data file contains ameasure of school academic performance as well as other attributes of the elementary schools, suchas, class size, enrolment, poverty, etc.,Now, performing a regression analysis using api00 as the outcome variable and thevariables acs_k3, meals and full as predictors. These measure the academic performance of theschool (api00), the average class size in kindergarten through 3rd grade (acs_k3), the percentage ofstudents receiving free meals (meals) - which is an indicator of poverty, and the percentage ofteachers who have full teaching credentials (full). We expect that better academic performance wouldbe associated with lower class size, fewer students receiving free meals, and a higher percentage ofteachers having full teaching credentials. The output is as follows:RegressionNotesOutput Created 02-Apr-2013 21:48:19CommentsInput Data C:UsersDivijDesktopSPSS Dataelemapi.sav Active Dataset DataSet5 Filter <none> Weight <none> Split File <none> N of Row s in Working Data File 400Missing Value Handling Definition of Missing User-defined missing values are treated as missing. Cases Used Statistics are based on cases with no missing values for any variable used.Syntax regression /dependent api00 /method=enter acs_k3 meals full .Resources Processor Time 00:00:00.063 Elapsed Time 00:00:00.026 Memory Required 2284 bytes Additional Memory Required for 0 bytes Residual Plots bVariables Entered/Removed
  • 24. Variables VariablesModel Entered Removed Method1 pct full credential, avg . Enter class size k-3, a pct free mealsa. All requested variables entered.b. Dependent Variable: api 2000Model Summary Adjusted R Std. Error of theModel R R Square Square Estimate a1 .821 .674 .671 64.153a. Predictors: (Constant), pct full credential, avg class size k-3, pctfree meals bANOVAModel Sum of Squares df Mean Square F Sig. a1 Regression 2634884.261 3 878294.754 213.407 .000 Residual 1271713.209 309 4115.577 Total 3906597.470 312a. Predictors: (Constant), pct full credential, avg class size k-3, pct free mealsb. Dependent Variable: api 2000 aCoefficients Standardized Unstandardized Coefficients CoefficientsModel B Std. Error Beta t Sig.1 (Constant) 906.739 28.265 32.080 .000 avg class size k-3 -2.682 1.394 -.064 -1.924 .055 pct free meals -3.702 .154 -.808 -24.038 .000 pct full credential .109 .091 .041 1.197 .232a. Dependent Variable: api 2000
  • 25. Lets test the three predictors on whether they are statistically significant and, if so, the direction of therelationship. The average class size (acs_k3, b=-2.682) is not significant (p=0.055), but only just so,and the coefficient is negative which would indicate that larger class sizes is related t o loweracademic performance, which is what we would expect. Next, the effect of meals (b=-3.702, p=.000)is significant and its coefficient is negative indicating that the greater the proportion students receivingfree meals, the lower the academic performance. We cannot say that free meals are causing loweracademic performance. The meals variable is highly related to income level and functions more as aproxy for poverty. Thus, higher levels of poverty are associated with lower academic performance.Finally, the percentage of teachers with full credentials (full, b=0.109, p=.2321) seems to be unrelatedto academic performance. This would seem to indicate that the percentage of teachers with fullcredentials is not an important factor in predicting academic performance which is unexpected.From these results, we would conclude that lower class sizes are related to higher performance, thatfewer students receiving free meals is associated with higher performance, and that the percentage ofteachers with full credentials was not related to academic performance in the schools. Before wewrite this up as our finding, we should do checks to make sure we can firmly stand behind theseresults.Examining DataStep 1)To start examining the data we have a look at the first 10 data points for the variables included in ourregression analysis. We need to lay focus on the number of missing data points in the given data.api00 acs_k3 meals full 693 16 67 76.00 570 15 92 79.00 546 17 97 68.00 571 20 90 87.00 478 18 89 87.00 858 20 . 100.00 918 19 . 100.00 831 20 . 96.00 860 20 . 100.00 737 21 29 96.00Number of cases read: 10 Number of cases listed: 10We see that among the first 10 observations, we have four missing values for meals. Keeping this inmind, we can use the descriptives command with /var=all to get descriptive statistics for all of thevariables, and pay special attention to the number of valid cases for meals.Step 2)Descriptive Statistics N Minimum Maximum Mean Std. Deviationschool number 400 58 6072 2866.81 1543.811district number 400 41 796 457.73 184.823
  • 26. api 2000 400 369 940 647.62 142.249api 1999 400 333 917 610.21 147.136growth 1999 to 2000 400 -69 134 37.41 25.247pct free meals 315 6 100 71.99 24.386english language learners 400 0 91 31.45 24.839year round school 400 0 1 .23 .421pct 1st year in school 399 2 47 18.25 7.485avg class size k-3 398 -21 25 18.55 5.005avg class size 4-6 397 20 50 29.69 3.841parent not hsg 400 0 100 21.25 20.676parent hsg 400 0 100 26.02 16.333parent some college 400 0 67 19.71 11.337parent college grad 400 0 100 19.70 16.471parent grad school 400 0 67 8.64 12.131avg parent ed 381 1.00 4.62 2.6685 .76379pct full credential 400 .42 100.00 66.0568 40.29793pct emer credential 400 0 59 12.66 11.746number of students 400 130 1570 483.47 226.448Percentage free meals in 400 1 3 2.02 .8193 categoriesValid N (listwise) 295Examining the output for the variables we used in our regression analysis above,namely api00, acs_k3, meals, full. For api00, we see that the values range from 369 to 940 andthere are 400 valid values. For acs_k3, the average class size ranges from -21 to 25 and there are 2missing values. An average class size of -21 sounds wrong. The variable meals ranges from 6%getting free meals to 100% getting free meals, so these values seem reasonable, but there are only315 valid values for this variable. The percent of teachers being full credentialed ranges from .42 to100, and all of the values are valid.This has uncovered a number of peculiarities worthy of further examination. We now obtain acorrected data set from the same source. This data set has got all the data corrected & is free fromthe shortcomings diagnosed above. We run another multiple regression on the new data set.
  • 27. New Multiple regression analysisFor this multiple regression example, we will regress the dependent variable, api00, on all of thepredictor variables in the data set.RegressionNotesOutput Created 02-Apr-2013 22:54:47CommentsInput Data C:UsersDivijDesktopSPSS Dataelemapi2.sav Active Dataset DataSet8 Filter <none> Weight <none> Split File <none> N of Row s in Working Data File 400Missing Value Handling Definition of Missing User-defined missing values are treated as missing. Cases Used Statistics are based on cases with no missing values for any variable used.Syntax regression /dependent api00 /method=enter ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll .Resources Processor Time 00:00:00.031 Elapsed Time 00:00:00.022 Memory Required 4724 bytes Additional Memory Required for 0 bytes Residual Plots bVariables Entered/Removed Variables VariablesModel Entered Removed Method
  • 28. 1 number of students, avg class size 4-6, pct 1st year in school, avg class size k-3, pct emer . Enter credential, english language learners, year round school, pct free meals, pct full a credentiala. All requested variables entered.b. Dependent Variable: api 2000Model Summary Adjusted R Std. Error of theModel R R Square Square Estimate a1 .919 .845 .841 56.768a. Predictors: (Constant), number of students, avg class size 4-6, pct1st year in school, avg class size k-3, pct emer credential, englishlanguage learners, year round school, pct free meals, pct fullcredential bANOVAModel Sum of Squares df Mean Square F Sig. a1 Regression 6740702.006 9 748966.890 232.409 .000 Residual 1240707.781 385 3222.618 Total 7981409.787 394a. Predictors: (Constant), number of students, avg class size 4-6, pct 1st year in school, avgclass size k-3, pct emer credential, english language learners, year round school, pct freemeals, pct full credentialb. Dependent Variable: api 2000 aCoefficients StandardizedModel Unstandardized Coefficients Coefficients t Sig.
  • 29. B Std. Error Beta1 (Constant) 758.942 62.286 12.185 .000 english language learners -.860 .211 -.150 -4.083 .000 pct free meals -2.948 .170 -.661 -17.307 .000 year round school -19.889 9.258 -.059 -2.148 .032 pct 1st year in school -1.301 .436 -.069 -2.983 .003 avg class size k-3 1.319 2.253 .013 .585 .559 avg class size 4-6 2.032 .798 .055 2.546 .011 pct full credential .610 .476 .064 1.281 .201 pct emer credential -.707 .605 -.058 -1.167 .244 number of students -.012 .017 -.019 -.724 .469a. Dependent Variable: api 2000 1) Examining the output from this regression analysis. As with the simple regression, we look to the p-value of the F-test to see if the overall model is significant. With a p-value of zero to three decimal places, the model is statistically significant. The R-squared is 0.845, meaning that approximately 85% of the variability of api00 is accounted for by the variables in the model. In this case, the adjusted R-squared indicates that about 84% of the variability ofapi00 is accounted for by the model, even after taking into account the number of predictor variables in the model. The coefficients for each of the variables indicates the amount of change one could expect in api00 given a one-unit change in the value of that variable, given that all other variables in the model are held constant. For example, consider the variable ell. We would expect a decrease of 0.86 in the api00 score for every one unit increase in ell, assuming that all other variables in the model are held constant. 2) R-Square is the proportion of variance in the dependent variable (api00) which can be predicted from the independent variables (ell, meals, yr_rnd, mobility, acs_k3, acs_46, full, emer and enroll). This value indicates that 84% of the variance in api00 can be predicted from the variables ell, meals,yr_rnd, mobility, acs_k3, acs_46, full, emer and enroll. 3) The beta coefficients are used by some researchers to compare the relative strength of the various predictors within the model. Because the beta coefficients are all measured in standard deviations, instead of the units of the variables, they can be compared to one another. In other words, the beta coefficients are the coefficients that you would obtain if the outcome and predictor variables were all transformed to standard scores, also cal led z- scores, before running the regression. In this example, meals has the largest Beta coefficient, -0.661, and acs_k3 has the smallest Beta, 0.013. Thus, a one standard deviation increase in meals leads to a 0.661 standard deviation decrease in predicted api00, with the other variables held constant. And, a one standard deviation increase in acs_k3, in turn, leads to a 0.013 standard deviation increase api00 with the other variables in the model held constant. 4) The adjusted R-square attempts to yield a more honest value to estimate the R-squared for the population. The value of R-square was .8446, while the value of Adjusted R-square was
  • 30. .8409. The adjusted R-square attempts to yield a more honest value to estimate the R- squared for the population.5) The F Value is the Mean Square Regression (748966.89) divided by the Mean Square Residual (3222.61761), yielding F=232.41. The p value associated with this F value is very small (0.0000). These values are used to answer the question "Do the independent variables reliably predict the dependent variable?". The p value is compared to your alpha level (typically 0.05) and, if smaller, you can conclude "Yes, the independent variables reliably predict the dependent variable".6) These are the degrees of freedom associated with the sources of variance. The Total variance has N-1 degrees of freedom (DF). In this case, there were N=395 observations, so the DF for total is 394.