Spss analysis conjoint_cluster_regression_pca_discriminant

Conjoint Analysis :

Conjoint Analysis is a marketing research technique designed to help determine preferences of
customers. It is used to analyse how customers value different attributes of a product ( or service)
and thus gives an insight into the trade-offs they are to make among the various attributes. To put
simply, it tells how much each feature of a product is worth to the consumers.

This study includes surveying people with a certain set of attribute combinations which the survey-
takers rank or provide preferences. Analysis will be done to model the customer preferences for
different combination of attributes. The attributes are termed factors and the different values are
levels.

In the example that we have taken to use Conjoint Analysis through the tool SPSS, we have analysed
data on carpet, taking attributes like Price, Brand, Money-return, Package design and Seal as the
attributes based on which the consumers give prefernces. Using two data sets, we calculate the part
worths and decide on the weightage of each of the attributes that the users have provided.

Variable name Variable label Value label
package package design A*, B*, C*
brand brand name K2R, Glory, Bissell
price price $1.19, $1.39, $1.59
seal Good Housekeeping seal no, yes
money money-back guarantee no, yes

Code to fetch import the data and analyse :

GET
FILE='C:UsersAbhiDesktopcarpet_plan.sav'.
DATASET NAME DataSet1 WINDOW=FRONT.
GET
FILE='C:UsersAbhiDesktopcarpet_prefs.sav'.
DATASET NAME DataSet2 WINDOW=FRONT.
CONJOINT PLAN='C:UsersAbhiDesktopcarpet_plan.sav'
/DATA='C:UsersAbhiDesktopcarpet_prefs.sav'
/SEQUENCE=PREF1 PREF2 PREF3 PREF4 PREF5 PREF6 PREF7 PREF8 PREF9 PREF10 PR
EF11 PREF12 PREF13 PREF14 PREF15 PREF16 PREF17 PREF18 P
REF19 PREF20 PREF21 PREF22
/SUBJECT=ID
/FACTORS=PACKAGE BRAND (DISCRETE)
PRICE (LINEAR LESS)
SEAL (LINEAR MORE) MONEY (LINEAR MORE)

/PRINT=SUMMARYONLY.

Model Description

Relation to Ranks
N of Levels or Scores

package 3 Discrete

brand 3 Discrete

price 3 Linear (less)

seal 2 Linear (more)

money 2 Linear (more)

Calculation of the part-worth of each attribute
Utilities

Utility Estimate Std. Error

package A* -2.233 .192

B* 1.867 .192

C* .367 .192

brand K2R .367 .192

Glory -.350 .192

Bissell -.017 .192

price $1.19 -6.595 .988

$1.39 -7.703 1.154

$1.59 -8.811 1.320

seal no 2.000 .287

yes 4.000 .575

money no 1.250 .287

yes 2.500 .575

(Constant) 12.870 1.282

This table shows the utility (part-worth) scores and their standard errors for each factor level. Higher
utility values indicate greater preference. We can see that the value of the part worths are such that,
for each attribute if part-worths are added for different levels, it sums up to zero. Thus with respect
to brand Glory and Bisell, K2R is preferred more. As expected, there is an inverse relationship
between price and utility, with higher prices corresponding to lower utility. The presence of a seal of
approval or money-back guarantee corresponds to a higher utility.Also, total utility of a combination
can be calculated as :

If the cleaner had package design C*, brand Bissell, price $1.59, a seal of approval, and a money -back
guarantee, the total utility would be:

0.367 + (−0.017) + (−8.811) + 4.000 + 2.500 + 12.870 = 10.909.

Importance:
Importance Values

package 35.635

brand 14.911

price 29.410

seal 11.172

money 8.872

We can see that attributes package has most importance followed by price. Money return is of least
concern for the consumer. The values are computed by taking the utility range for each factor
separately and dividing by the sum of the utility ranges for al l factors. The values thus represent
percentages and have the property that they sum to 100.

Coefficients

B Coefficient

Estimate

price -5.542

seal 2.000

money 1.250

The utility for a particular factor level is determined by multiplying the level by the coefficient. For
example, the predicted utility for a price of $1.19 was listed as −6.595 in the utilities table. This is
simply the value of the price level, 1.19, multiplied by the price coefficient, −5.542.

This table provides measures of the correlation between the observed and estimated preferences.

Preference Scores of
Simulations a

Card
Number ID Score

1 1 10.258

2 2 14.292

The real power of conjoint analysis is the ability to predict preference for product profiles that
weren't rated by the subjects. These are referred to as simulation cases.

b
Preference Probabilities of Simulations

Card Bradley-Terry-
a
Number ID Maximum Utility Luce Logit

1 1 30.0% 43.1% 30.9%

2 2 70.0% 56.9% 69.1%

The maximum utility model determines the probability as the number
of respondents predicted to choose the profile divided by the total
number of respondents. For each respondent, the predicted choice is
simply the profile with the largest total utility.

Number of Reversals

Factor price 3

money 2

seal 2

brand 0

package 0

Subject 1 Subject 1 1

2 Subject 2 2

3 Subject 3 0

4 Subject 4 0

5 Subject 5 0

6 Subject 6 1

7 Subject 7 0

8 Subject 8 0

9 Subject 9 1

10 Subject 10 2

This table displays the number of reversals for each factor and for each subject. For example, three
subjects showed a reversal for price. That is, they preferred product profiles with higher prices.

Reversal Summary

N of
Revers
als N of Subjects

1 3

2 2

Q. Perform Discriminant Analysis on the given dataset.
The dataset chosen contains statistics on set of people who have been given bank loans & have defaulted or not defaulted with their various characteristics.

Discriminant
Notes
Output Created 04-Apr-2013 18:39:05
p{color:black;font-family:sans-serif;font-size:10pt;font-
Comments weight:normal}
Input Data E:VGSOMSTUDYSECOND Your trial period for SPSS for Windows will expire in 14 da
SEMBRMSPSS16Samplesbanklo ys.p{color:0;font-family:Monospaced;font-size:13pt;font-
style:normal;font-weight:normal;text-decoration:none}
an.sav GET
Active Dataset DataSet1 FILE='E:VGSOMSTUDYSECOND SEMBRMSPSS16Samplesbanklo
an.sav'.
File Label Bank Loan Default DATASET NAME DataSet1 WINDOW=FRONT.
Filter <none> DISCRIMINANT
/GROUPS=default(0 1)
Weight <none> /VARIABLES=employ address age
Split File <none> /ANALYSIS ALL
/PRIORS EQUAL
N of Rows in Working /STATISTICS=MEAN STDDEV UNIVF BOXM COEFF CORR TABLE
850
Data File /PLOT=COMBINED
/PLOT=CASES
Missing Value Handling Definition of Missing User-defined missing values are
treated as missing in the analysis /CLASSIFY=NONMISSING POOLED MEANSUB.
phase.
Cases Used In the analysis phase, cases with no
user- or system-missing values for
any predictor variable are used.
Cases with user-, system-missing, or
out-of-range values for the
grouping variable are always
excluded.
Syntax DISCRIMINANT
/GROUPS=default(0 1)
/VARIABLES=employ address age
/ANALYSIS ALL
/PRIORS EQUAL
/STATISTICS=MEAN STDDEV UNIVF
BOXM COEFF CORR TABLE
/PLOT=COMBINED
/PLOT=CASES
/CLASSIFY=NONMISSING POOLED
MEANSUB.

Resources Processor Time 00:00:00.047
[DataSet1] E:VGSOMSTUDYSECOND SEMBRMSPSS16Samplesbankloan.sav
Elapsed Time 00:00:00.121

Warnings
All-Groups Stacked Histogram is no longer displayed.

Analysis Case Processing Summary
Unweighted Cases N Percent
Valid 700 82.4
Excluded Missing or out-of-range
150 17.6
group codes
At least one missing
0 .0
discriminating variable
Both missing or out-of-
range group codes and at
0 .0
least one missing
Total 150 17.6
Total 850 100.0

Group Statistics
Valid N (listwise)
Previously defaulted Mean Std. Deviation Unweighted Weighted
No Years with current
9.51 6.664 517 517.000
employer
Years at current address 8.95 7.001 517 517.000
Age in years 35.51 7.708 517 517.000
Yes Years with current
5.22 5.543 183 183.000
employer
Age in years 33.01 8.518 183 183.000
Total Years with current
8.39 6.658 700 700.000
employer
Age in years 34.86 7.997 700 700.000

Tests of Equality of Group Means
Wilks' Lambda F df1 df2 Sig.
Years with current
.920 60.759 1 698 .000
employer
Years at current address .973 19.402 1 698 .000
Age in years .981 13.482 1 698 .000

Pooled Within-Groups Matrices
Years with
current Years at
employer current address Age in years This matrix shows correlation between the predictors. The largest
Correlation Years with current correlations occur between Credit card debt in thousands and the
1.000 .292 .524 other variables.
employer
Years at current address .292 1.000 .588
Age in years .524 .588 1.000

Analysis 1
Box's Test of Equality of Covariance Matrices

Log Determinants
Log
Previously defaulted Rank Determinant
No 3 11.012
Yes 3 10.501
Pooled within-groups 3 10.919
The ranks and natural logarithms of determinants
printed are those of the group covariance
matrices.
Test Results
Box's M 28.171
F Approx. 4.665
df1 6
df2 7.335E5
Sig. .000

Log Determinants
Log
Previously defaulted Rank Determinant
No 3 11.012
Yes 3 10.501
Pooled within-groups 3 10.919
Tests null hypothesis of
equal population covariance
matrices.
Summary of Canonical Discriminant Functions

Eigenvalues
Functio Canonical
n Eigenvalue % of Variance Cumulative % Correlation
1 .100a 100.0 100.0 .301
a. First 1 canonical discriminant functions were used in the analysis.

Wilks' Lambda
Test of
Functio
n(s) Wilks' Lambda Chi-square df Sig.
1 .909 66.251 3 .000

Standardized Canonical Discriminant
Function Coefficients
Function
1
Years with current
.980
employer
Years at current address .436
Age in years -.330

Structure Matrix
Function
1
Years with current
.934
employer
Years at current address .528
Age in years .440
Pooled within-groups correlations
between discriminating variables and
standardized canonical discriminant
functions
Variables ordered by absolute size of
correlation within function.

Functions at Group
Centroids
Previo Function
usly
default
ed 1
No .188
Yes -.530
Unstandardized
canonical
discriminant
functions evaluated
at group means

Classification Statistics

Classification Processing Summary
Processed 850
Excluded Missing or out-of-range
0
group codes
At least one missing
0
Used in Output 850

Prior Probabilities for Groups
Previo Cases Used in Analysis
usly
default
ed Prior Unweighted Weighted
No .500 517 517.000
Yes .500 183 183.000
Total 1.000 700 700.000

Classification Function Coefficients
Previously defaulted
No Yes
Years with current
-.192 -.302
employer
Years at current address -.302 -.348
Age in years .797 .827
(Constant) -12.588 -12.444
Fisher's linear discriminant functions

Classification Resultsa

Previously Predicted Group Membership
defaulted No Yes Total The Discriminant Analysis shows that the persons in the category
Original Count No 300 217 517 who have previously defaulted are predicted likely to default this
Yes 44 139 183 time as well & those who haven’t defaulted earlier are predicted less
Ungrouped cases 81 69 150 likely to default this time.
% No 58.0 42.0 100.0 The conclusion is inferred from the total no. of defaulters being
Yes 24.0 76.0 100.0 more than non defaulters (139>44) similarly (300>217).
Ungrouped cases 54.0 46.0 100.0
a. 62.7% of original grouped cases correctly classified.

Q. Perform Factor Analysis on the given dataset.

The dataset chosen contains fictional statistics anxiety questionnaire. It contains response given
by students regarding their ease of use, liking and usage of SPSS in statistics.

By using the Scree Plot I have chosen 5 factors.

Since a student may give related answers depending upon the choices hence I considered the
variables to be inter-related and hence used Oblimin rotation. Say a student gave high points for
variable “I have little experience of computers” is likely to give high points for “All computers
hate me” as the variables are correlated somewhat.

Using the options of SPSS the following Pattern Matrix was generated.
Pattern Matrix a

Component

1 2 3 4 5

I have little experience of .903
computers

SPSS always crashes when I .732
try to use it

All computers hate me .684

I worry that I will cause .662
irreparable damage because
of my incompetenece with
computers

Computers have minds of .581
their own and deliberately go
wrong whenever I use them

People try to tell you that .446
SPSS makes statistics easier
to understand but it doesn't

Computers are out to get me .333

My friends are better at SPSS .661
than I am

My friends are better at .655
statistics than me

If I'm good at statistics my .622
friends will think I'm a nerd

My friends will think I'm stupid .504 .330
for not being able to cope
with SPSS

Everybody looks at me when .358 .358
I use SPSS

I can't sleep for thoughts of -.728
eigen vectors

I wake up under my duvet .324 -.543
thinking that I am trapped
under a normal distribtion

Computers are useful only for .359 .393 -.366
playing games

Standard deviations excite .301 .356 .315
me

I have never been good at -.855
mathematics

I did badly at mathematics at -.736
school

I slip into a coma whenever I -.722
see an equation

Statiscs makes me cry -.772

I don't understand statistics -.730

I weep openly at the mention -.664
of central tendency

I dream that Pearson is -.564
attacking me with correlation
coefficients

Extraction Method: Principal Component Analysis.
Rotation Method: Oblimin with Kaiser Normalization.

a. Rotation converged in 15 iterations.

The total variance explained by each factor is given below

Total Variance Explained

Rotation Sums of
Squared
Loadings a
Compo
nent Total

1 5.522

2 2.452

3 2.383

4 3.535

5 4.913

Extraction Method:
Principal Component
Analysis.

It is calculated by the sum of squared loadings of the factor and dividing the sum of squared loadings by
the number of variables and multiplying by 100.

Hence the factoring would be as follows depending on the loading values.

Factor Variable Nos.
1 1,2,3,4,5,6,7,14
2 8,9,10
3 13
4 17,18,19
5 20,21,22,23

Since variables 11, 12, 15 and 16 have very close loadings in different factors it is not good as this
variable is assessing both constructs.15 has exact same value in both Factor 2 and Factor 3.These are
said to have split loading.

They are hence mentioned in a separately.

Factor Variable No
2 11,16,15
3 12,15

As Split loading is present this is not a simple structure.

Factor 1: Anxiety about the usage of computers accounts for 55.22% of the total variance and loads 8 of
the variables.

Factor 2: View of students regarding their understanding of statistics and SPSS with regard to their peers
accounts for 24.52% of the total variance and loads 3 variables. It also split loads variable 11, 16 and 15.

Factor 3: Anxiety about Eigen vectors corresponds to only 23.83% of the total variance and loads only 1
variable directly while it split loads variable 12 and 15.

Factor 4: Students interest in mathematics accounts for 35.35% of the total variance and loads 3
variable.

Factor 5: Dislike for statistics accounts for 49.13% of the total variance and loads 4 variables.

CLUSTER ANALYSIS

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same
group (called cluster) are more similar (in some sense or another) to each other than to those in other groups
(clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis
used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and
bioinformatics.

Proximities

Notes

Output Crea ted 02-Apr-2013 22:00:05

Comments

Input Da ta C:Us ers dev
ma l etiaDownloadsClusterAnonFaculty.sav

Acti ve Da taset Da ta Set3

Fi l ter <none>

Wei ght <none>

Spl it File <none>

N of Rows i n Working Data File 44

Mi s sing Value Handling Defi nition of Missing Us er-defined missing values a re treated as missing.

Ca s es Used Sta ti stics a re based on cases with no missing values
for a ny va ri able used.

Synta x PROXIMITIES Sa l ary FTE Ra nk Arti cles Experience
/MATRIX
OUT('C:Us ersDEVMAL~1AppDataLocalTempspss
6496s pssclus.tmp')
/VIEW=CASE
/MEASURE=SEUCLID
/PRINT NONE
/ID=Name
/STANDARDIZE=VARIABLE Z.

Res ources Proces sor Ti me 00:00:00.078

El a psed Time 00:00:00.082

Works pace Bytes 11152

Fi l es Saved Ma tri x Fi le C:Us ers DEVMAL~1AppDataLocalTempspss6496
s pssclus.tmp

The variables are which I have used in the dataset are as follows:
• Name -- Although faculty salaries are public information under North Carolina state law
• Salary – annual salary in dollars, from the university report available in One Stop.
• FTE – Full time equivalent work load for the faculty member.
• Rank – where 1 = adjunct, 2 = visiting, 3 = assistant, 4 = associate, 5 = professor
• Articles – number of published scholarly articles, excluding things like comments in newsletters,
abstracts in proceedings, and the like.
• Experience – Number of years working as a full time faculty member in a Department of Psychology.
• ArticlesAPD – number of published articles as listed in the university’s Academic Publications
• Sex –biological sex from physical appearance.

In the first step SPSS computes for each pair of cases the squared Euclidian distance between the cases. This is
quite simply, the sum across variables (from i = 1 to v) of the squared difference between the score on variable
i for the one case (Xi) and the score on variable i for the other case (Yi). The two cases which are separated by
the smallest Euclidian distance are identified and then classified together into the first cluster. At this point
there is one cluster with two cases in it.
Next SPSS re-computes the squared Euclidian distances between each entity (case or cluster) and each other
entity. When one or both of the compared entities is a cluster, SPSS computes the averaged squared Euclidian
distance between members of the one entity and members of the other entity. The two entities with the
smallest squared Euclidian distance are classified together. SPSS then re-computes the squared Euclidian
distances between each entity and each other entity and the two with the smallest squared Euclidian distance
are classified together. This continues until all of the cases have been clustered into one big cluster.

The output obtained can be seen below:

Case Processing Summary a

Ca s es

Va l i d Mi s s i ng Tota l

N Percent N Percent N Percent

44 100.0% 0 .0% 44 100.0%

a. Squa red Euclidean Distance used

On the first step SPSS clustered case 32 with 33. The squared Euclidian distance between these two cases is
0.000. At stages 2-4 SPSS creates three more clusters, each containing two cases. At stage 5 SPSS adds case
39 to the cluster that already contains cases 37 and 38. By the 43rd stage all cases have been clustered into
one entity.
The results can be seen below:

Average Linkage (Between Groups)

Agglomeration Schedule

Cl us ter Combi ned Sta ge Cl us ter Fi rs t Appea rs

Sta ge Cl us ter 1 Cl us ter 2 Coeffi ci ents Cl us ter 1 Cl us ter 2 Next Sta ge

1 32 33 .000 0 0 9

2 41 42 .000 0 0 6

3 43 44 .000 0 0 6

4 37 38 .000 0 0 5

5 37 39 .001 4 0 7

6 41 43 .002 2 3 27

7 36 37 .003 0 5 27

8 20 22 .007 0 0 11

9 30 32 .012 0 1 13

10 21 26 .012 0 0 14

11 20 25 .031 8 0 12

12 16 20 .055 0 11 14

13 29 30 .065 0 9 26

14 16 21 .085 12 10 20

15 11 18 .093 0 0 22

16 8 9 .143 0 0 25

17 17 24 .144 0 0 20

18 13 23 .167 0 0 22

19 14 15 .232 0 0 32

20 16 17 .239 14 17 23

21 7 12 .279 0 0 28

22 11 13 .441 15 18 29

23 16 27 .451 20 0 26

24 3 10 .572 0 0 28

25 6 8 .702 0 16 36

26 16 29 .768 23 13 35

27 36 41 .858 7 6 33

28 3 7 .904 24 21 31

29 11 28 .993 22 0 30

30 5 11 1.414 0 29 34

31 3 4 1.725 28 0 36

32 14 31 1.928 19 0 34

33 36 40 2.168 27 0 40

34 5 14 2.621 30 32 35

35 5 16 2.886 34 26 37

36 3 6 3.089 31 25 38

37 5 19 4.350 35 0 39

38 1 3 4.763 0 36 41

39 5 34 5.593 37 0 42

40 35 36 8.389 0 33 43

41 1 2 8.961 38 0 42

42 1 5 11.055 41 39 43

43 1 35 17.237 42 40 0

Cluster Membership

Ca s e 5 Cl us ters 4 Cl us ters 3 Cl us ters 2 Cl us ters

1:Ros alyn 1 1 1 1

2:La wrence 2 2 1 1

3:Suni la 1 1 1 1

4:Ra ndolph 1 1 1 1

5:Mi ckey 3 3 2 1

6:Loui s 1 1 1 1

7:Tony 1 1 1 1

8:Ra ul 1 1 1 1

9:Ca ta l ina 1 1 1 1

10:Johns on 1 1 1 1

11:Beul ah 3 3 2 1

12:Ma rti na 1 1 1 1

13:Ma ri e 3 3 2 1

14:Ernes t 3 3 2 1

15:Chri s topher 3 3 2 1

16:Erni e 3 3 2 1

17:Chri s ta 3 3 2 1

18:Li nette 3 3 2 1

19:Bo 3 3 2 1

20:Ca rl a 3 3 2 1

21:Al berto 3 3 2 1

22:Chri s ti na 3 3 2 1

23:Jona h 3 3 2 1

24:Tucker 3 3 2 1

25:Sha nta 3 3 2 1

26:Mel i ssa 3 3 2 1

27:Jenna 3 3 2 1

28:Johnny 3 3 2 1

29:Cl ea tus 3 3 2 1

30:Jona s 3 3 2 1

31:Ta d 3 3 2 1

32:Ama ryl l is 3 3 2 1

33:Na tha n 3 3 2 1

34:Dea nna 3 3 2 1

35:Wi l ly 4 4 3 2

36:Dea na 5 4 3 2

37:Dea 5 4 3 2

38:Cl a ude 5 4 3 2

39:Ama nda 5 4 3 2

40:Bori s 5 4 3 2

41:Ga rrett 5 4 3 2

42:Stew 5 4 3 2

43:Bree 5 4 3 2

44:Ka rma 5 4 3 2

Vertical Icicle:
In this document, it is not possible to display the full vertical icicle, but, yet, the results for the same are
described below.
For the two cluster solution you can see that one cluster consists of ten cases (Boris through Willy, followed by
a column with no X’s). These were our adjunct (part-time) faculty (excepting one) and the second cluster
consists of everybody else.
For the three cluster solution you can see the cluster of adjunct faculty and the others split into two. Deanna
through Mickey were our junior faculty and Lawrence through Rosalyn our senior faculty
For the four cluster solution you can see that one case (Lawrence) forms a cluster of his own.

Dendrogram
It displays essentially the same information that is found in the agglomeration schedule but in graphic form.

* * * * * * * * * * * H I E R A R C H I C A L C L U S T E R A N A L Y S I S * * * * * * * * * *

Dendrogram using Average Linkage (Between Groups)

Rescaled Distance Cluster Combine

C A S E 0 5 10 15 20 25
Label Num +---------+---------+---------+---------+---------+

Amaryllis 32 ─┐
Nathan 33 ─┤
Jonas 30 ─┼─┐
Cleatus 29 ─┘ │
Alberto 21 ─┐ │
Melissa 26 ─┤ │
Carla 20 ─┤ ├─────┐
Christina 22 ─┤ │ │
Shanta 25 ─┤ │ │
Ernie 16 ─┤ │ │
Christa 17 ─┼─┘ │
Tucker 24 ─┤ │
Jenna 27 ─┘ ├───┐
Beulah 11 ─┐ │ │
Linette 18 ─┼─┐ │ │
Marie 13 ─┤ ├─┐ │ │
Jonah 23 ─┘ │ ├─┐ │ │
Johnny 28 ───┘ │ │ │ ├───┐
Mickey 5 ─────┘ ├─┘ │ │
Ernest 14 ─┬───┐ │ │ │
Christopher 15 ─┘ ├─┘ │ ├───────────────┐
Tad 31 ─────┘ │ │ │
Bo 19 ─────────────┘ │ │
Deanna 34 ─────────────────┘ │
Raul 8 ─┬─┐ │
Catalina 9 ─┘ ├─────┐ ├───────────────┐
Louis 6 ───┘ │ │ │
Tony 7 ─┬─┐ ├───┐ │ │
Martina 12 ─┘ ├─┐ │ │ │ │
Sunila 3 ─┬─┘ ├───┘ ├───────────┐ │ │
Johnson 10 ─┘ │ │ │ │ │
Randolph 4 ─────┘ │ ├───────┘ │
Rosalyn 1 ─────────────┘ │ │
Lawrence 2 ─────────────────────────┘ │
Garrett 41 ─┐ │
Stew 42 ─┼─┐ │
Bree 43 ─┤ │ │
Karma 44 ─┘ ├───┐ │
Dea 37 ─┐ │ │ │
Claude 38 ─┤ │ ├─────────────────┐ │
Amanda 39 ─┼─┘ │ │ │
Deana 36 ─┘ │ ├───────────────────────┘
Boris 40 ───────┘ │

Willy 35 ─────────────────────────┘

Multiple Regression Analysis
In this Analysis we are using a data file that was created by randomly sampling 400 elementary
schools from the California Department of Education's API 2000 dataset. This data file contains a
measure of school academic performance as well as other attributes of the elementary schools, such
as, class size, enrolment, poverty, etc.,

Now, performing a regression analysis using api00 as the outcome variable and the
variables acs_k3, meals and full as predictors. These measure the academic performance of the
school (api00), the average class size in kindergarten through 3rd grade (acs_k3), the percentage of
students receiving free meals (meals) - which is an indicator of poverty, and the percentage of
teachers who have full teaching credentials (full). We expect that better academic performance would
be associated with lower class size, fewer students receiving free meals, and a higher percentage of
teachers having full teaching credentials. The output is as follows:

Regression
Notes


Comments

Input Data C:UsersDivijDesktopSPSS Dataelemapi.sav

Active Dataset DataSet5

Filter <none>

Weight <none>

Split File <none>

N of Row s in Working Data File 400

Missing Value Handling Definition of Missing User-defined missing values are treated as
missing.

Cases Used Statistics are based on cases with no missing
values for any variable used.

Syntax regression

/dependent api00

/method=enter acs_k3 meals full

.



Memory Required 2284 bytes

Additional Memory Required for
0 bytes
Residual Plots

b
Variables Entered/Removed

Variables Variables
Model Entered Removed Method

1 pct full
credential, avg
. Enter
class size k-3,
a
pct free meals

a. All requested variables entered.

b. Dependent Variable: api 2000

Model Summary

Adjusted R Std. Error of the
Model R R Square Square Estimate

a
1 .821 .674 .671 64.153

a. Predictors: (Constant), pct full credential, avg class size k-3, pct
free meals
b
ANOVA

Model Sum of Squares df Mean Square F Sig.

a
1 Regression 2634884.261 3 878294.754 213.407 .000

Residual 1271713.209 309 4115.577

Total 3906597.470 312

a. Predictors: (Constant), pct full credential, avg class size k-3, pct free meals

a
Coefficients

Standardized
Unstandardized Coefficients Coefficients

Model B Std. Error Beta t Sig.

1 (Constant) 906.739 28.265 32.080 .000

avg class size k-3 -2.682 1.394 -.064 -1.924 .055

pct free meals -3.702 .154 -.808 -24.038 .000

pct full credential .109 .091 .041 1.197 .232

a. Dependent Variable: api 2000

Let's test the three predictors on whether they are statistically significant and, if so, the direction of the
relationship. The average class size (acs_k3, b=-2.682) is not significant (p=0.055), but only just so,
and the coefficient is negative which would indicate that larger class sizes is related t o lower
academic performance, which is what we would expect. Next, the effect of meals (b=-3.702, p=.000)
is significant and its coefficient is negative indicating that the greater the proportion students receiving
free meals, the lower the academic performance. We cannot say that free meals are causing lower
academic performance. The meals variable is highly related to income level and functions more as a
proxy for poverty. Thus, higher levels of poverty are associated with lower academic performance.
Finally, the percentage of teachers with full credentials (full, b=0.109, p=.2321) seems to be unrelated
to academic performance. This would seem to indicate that the percentage of teachers with full
credentials is not an important factor in predicting academic performance which is unexpected.

From these results, we would conclude that lower class sizes are related to higher performance, that
fewer students receiving free meals is associated with higher performance, and that the percentage of
teachers with full credentials was not related to academic performance in the schools. Before we
write this up as our finding, we should do checks to make sure we can firmly stand behind these
results.

Examining Data

Step 1)

To start examining the data we have a look at the first 10 data points for the variables included in our

regression analysis. We need to lay focus on the number of missing data points in the given data.

api00 acs_k3 meals full

693 16 67 76.00
570 15 92 79.00
546 17 97 68.00
571 20 90 87.00
478 18 89 87.00
858 20 . 100.00
918 19 . 100.00
831 20 . 96.00
860 20 . 100.00
737 21 29 96.00

Number of cases read: 10 Number of cases listed: 10

We see that among the first 10 observations, we have four missing values for meals. Keeping this in
mind, we can use the descriptives command with /var=all to get descriptive statistics for all of the
variables, and pay special attention to the number of valid cases for meals.

Step 2)

Descriptive Statistics

N Minimum Maximum Mean Std. Deviation

school number 400 58 6072 2866.81 1543.811

district number 400 41 796 457.73 184.823

api 2000 400 369 940 647.62 142.249

api 1999 400 333 917 610.21 147.136

growth 1999 to 2000 400 -69 134 37.41 25.247

pct free meals 315 6 100 71.99 24.386

english language learners 400 0 91 31.45 24.839

year round school 400 0 1 .23 .421

pct 1st year in school 399 2 47 18.25 7.485

avg class size k-3 398 -21 25 18.55 5.005

avg class size 4-6 397 20 50 29.69 3.841

parent not hsg 400 0 100 21.25 20.676

parent hsg 400 0 100 26.02 16.333

parent some college 400 0 67 19.71 11.337

parent college grad 400 0 100 19.70 16.471

parent grad school 400 0 67 8.64 12.131

avg parent ed 381 1.00 4.62 2.6685 .76379

pct full credential 400 .42 100.00 66.0568 40.29793

pct emer credential 400 0 59 12.66 11.746

number of students 400 130 1570 483.47 226.448

Percentage free meals in
400 1 3 2.02 .819
3 categories

Valid N (listwise) 295

Examining the output for the variables we used in our regression analysis above,
namely api00, acs_k3, meals, full. For api00, we see that the values range from 369 to 940 and
there are 400 valid values. For acs_k3, the average class size ranges from -21 to 25 and there are 2
missing values. An average class size of -21 sounds wrong. The variable meals ranges from 6%
getting free meals to 100% getting free meals, so these values seem reasonable, but there are only
315 valid values for this variable. The percent of teachers being full credentialed ranges from .42 to
100, and all of the values are valid.

This has uncovered a number of peculiarities worthy of further examination. We now obtain a
corrected data set from the same source. This data set has got all the data corrected & is free from
the shortcomings diagnosed above. We run another multiple regression on the new data set.

New Multiple regression analysis

For this multiple regression example, we will regress the dependent variable, api00, on all of the
predictor variables in the data set.

Regression
Notes


Comments

Input Data C:UsersDivijDesktopSPSS

Dataelemapi2.sav

Active Dataset DataSet8

Filter <none>

Weight <none>

Split File <none>

N of Row s in Working Data File 400

Missing Value Handling Definition of Missing User-defined missing values are treated as
missing.

Cases Used Statistics are based on cases with no missing

values for any variable used.

Syntax regression
/dependent api00

/method=enter ell meals yr_rnd mobility
acs_k3 acs_46 full emer enroll .



Memory Required 4724 bytes

Additional Memory Required for
0 bytes
Residual Plots

b
Variables Entered/Removed

Variables Variables
Model Entered Removed Method

1 number of
students, avg
class size 4-6,
pct 1st year in
school, avg
class size k-3,
pct emer
. Enter
credential,
english language
learners, year
round school,
pct free meals,
pct full
a
credential

a. All requested variables entered.

Model Summary

Adjusted R Std. Error of the
Model R R Square Square Estimate
a
1 .919 .845 .841 56.768

a. Predictors: (Constant), number of students, avg class size 4-6, pct
1st year in school, avg class size k-3, pct emer credential, english
language learners, year round school, pct free meals, pct full
credential

b
ANOVA

Model Sum of Squares df Mean Square F Sig.
a
1 Regression 6740702.006 9 748966.890 232.409 .000

Residual 1240707.781 385 3222.618

Total 7981409.787 394

a. Predictors: (Constant), number of students, avg class size 4-6, pct 1st year in school, avg
class size k-3, pct emer credential, english language learners, year round school, pct free
meals, pct full credential

a
Coefficients

Standardized
Model Unstandardized Coefficients Coefficients t Sig.

B Std. Error Beta

1 (Constant) 758.942 62.286 12.185 .000

english language learners -.860 .211 -.150 -4.083 .000

pct free meals -2.948 .170 -.661 -17.307 .000

year round school -19.889 9.258 -.059 -2.148 .032

pct 1st year in school -1.301 .436 -.069 -2.983 .003

avg class size k-3 1.319 2.253 .013 .585 .559

avg class size 4-6 2.032 .798 .055 2.546 .011

pct full credential .610 .476 .064 1.281 .201

pct emer credential -.707 .605 -.058 -1.167 .244

number of students -.012 .017 -.019 -.724 .469

a. Dependent Variable: api 2000

1) Examining the output from this regression analysis. As with the simple regression, we look to
the p-value of the F-test to see if the overall model is significant. With a p-value of zero to
three decimal places, the model is statistically significant. The R-squared is 0.845, meaning
that approximately 85% of the variability of api00 is accounted for by the variables in the
model. In this case, the adjusted R-squared indicates that about 84% of the variability
ofapi00 is accounted for by the model, even after taking into account the number of predictor
variables in the model. The coefficients for each of the variables indicates the amount of
change one could expect in api00 given a one-unit change in the value of that variable, given
that all other variables in the model are held constant. For example, consider the
variable ell. We would expect a decrease of 0.86 in the api00 score for every one unit
increase in ell, assuming that all other variables in the model are held constant.

2) R-Square is the proportion of variance in the dependent variable (api00) which can be
predicted from the independent variables (ell, meals, yr_rnd,
mobility, acs_k3, acs_46, full, emer and enroll). This value indicates that 84% of the
variance in api00 can be predicted from the
variables ell, meals,yr_rnd, mobility, acs_k3, acs_46, full, emer and enroll.

3) The beta coefficients are used by some researchers to compare the relative strength of the
various predictors within the model. Because the beta coefficients are all measured in
standard deviations, instead of the units of the variables, they can be compared to one
another. In other words, the beta coefficients are the coefficients that you would obtain if the
outcome and predictor variables were all transformed to standard scores, also cal led z-
scores, before running the regression. In this example, meals has the largest Beta coefficient,
-0.661, and acs_k3 has the smallest Beta, 0.013. Thus, a one standard deviation increase
in meals leads to a 0.661 standard deviation decrease in predicted api00, with the other
variables held constant. And, a one standard deviation increase in acs_k3, in turn, leads to a
0.013 standard deviation increase api00 with the other variables in the model held constant.
4) The adjusted R-square attempts to yield a more honest value to estimate the R-squared for
the population. The value of R-square was .8446, while the value of Adjusted R-square was

.8409. The adjusted R-square attempts to yield a more honest value to estimate the R-
squared for the population.

5) The F Value is the Mean Square Regression (748966.89) divided by the Mean Square
Residual (3222.61761), yielding F=232.41. The p value associated with this F value is very
small (0.0000). These values are used to answer the question "Do the independent variables
reliably predict the dependent variable?". The p value is compared to your alpha level
(typically 0.05) and, if smaller, you can conclude "Yes, the independent variables reliably
predict the dependent variable".

6) These are the degrees of freedom associated with the sources of variance. The Total
variance has N-1 degrees of freedom (DF). In this case, there were N=395 observations, so
the DF for total is 394.

Spss analysis conjoint_cluster_regression_pca_discriminant

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spss analysis conjoint_cluster_regression_pca_discriminant

Similar to Spss analysis conjoint_cluster_regression_pca_discriminant (20)

More from Dev Karan Singh Maletia

More from Dev Karan Singh Maletia (8)

Recently uploaded

Recently uploaded (20)

Spss analysis conjoint_cluster_regression_pca_discriminant