Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

YamadaiR(Categorical Factor Analysis)

2,896 views

Published on

Special Thanks to @KsTIME19 (on Twitter), who re-printed my PDF files that can read on Slide Share! :)

Published in: Education
  • Be the first to comment

YamadaiR(Categorical Factor Analysis)

  1. 1. R をつかったカテゴリカル因子分析 小杉考司 やまだいあ~る 2012/10/05Kosugi,E.Koji (Yamadai.R) Categorical Factor Analysis by using R 2012/10/05 1/9
  2. 2. Why we use... 因子分析をしたいけど,3 件法だったらダメっていわれた 5 件法でデータを取ったけど,データが偏っていた 因子分析をしたけど,項目がどんどん落ちちゃ って・・ ・ Kosugi,E.Koji (Yamadai.R) Categorical Factor Analysis by using R 2012/10/05 2/9
  3. 3. FA vs categorical FA因子分析とは,多変量解析のひとつで,たくさんの質問項目に共通する要因を取り出してくる技術。具体的な計算手続きは, 次の通りです。 1 データから相関行列を作成 2 相関行列を固有値分解 3 固有値から因子の数を決める。固有ベクトルから因子負荷量 を求める。ここで,相関行列とは,「ピアソンの積率相関係数」であり,これを求めるためには データが間隔尺度水準以上 で得られている必要がある。 Kosugi,E.Koji (Yamadai.R) Categorical Factor Analysis by using R 2012/10/05 3/9
  4. 4. One of the reasons 3 件法は間隔尺度水準とはいえない(統計的には 7 件法以上) データの偏り=上方・下方のいずれかのカテゴリが弁別でき てない 分析の元になる相関係数が小さい値=偏っているので分散が 小さい Kosugi,E.Koji (Yamadai.R) Categorical Factor Analysis by using R 2012/10/05 4/9
  5. 5. 問題は,相関係数の出し方が「順序尺度水準」「名義尺度水準」に対応していたら解決される。例えば狩野・三浦 (2002) によると、順序尺度を分析するには 1 連続とみなす 2 多分相関係数(polychoric correlation coefficient) ,多分系列相 関係数 (polyserial correlation coefficient) を使う 3 多項分布に基づく方法をとるの三択になるとしている。 Kosugi,E.Koji (Yamadai.R) Categorical Factor Analysis by using R 2012/10/05 5/9
  6. 6. 順序尺度水準の相関係数とは ポリコリック相関係数 Polychoric Correlation は「多分相関 係数」と訳される。順序尺度と順序尺度の相関係数である。 ポリシリアル相関係数 Polyserial Correlation は「多分系列相 関係数」あるいは「重双相関係数」と訳される。順序尺度と 連続尺度の相関係数である。 テトラコリック相関係数 Tetrachoric Correlation は四分相関 係数と訳される。四分は2×2、つまり二値データ同士の相 関係数である。これはポリコリック相関係数の特殊な場合で ある。Kosugi,E.Koji (Yamadai.R) Categorical Factor Analysis by using R 2012/10/05 6/9
  7. 7. images of latent continuity Figure : image of latent continuity and expression変数 x の奥に潜在変数 ξ があり、それが正規分布していると仮定する。変数 x と ξ の関係は次のように書ける。 x = 1 ξ < a1 x = 2 a1 ≤ ξ < a2 x = 3 a2 ≤ ξ < a3 (1) . . . . . . x = s as−1 ≤ ξ Kosugi,E.Koji (Yamadai.R) Categorical Factor Analysis by using R 2012/10/05 7/9
  8. 8. 順序尺度の相関係数 目に見えない潜在変数レベルで二変数が相関しており,それ がカテゴリカルに表現されていると考える。 そうすると求めるのは,潜在レベルでの相関係数 ρ と変数 X,Y のカテゴリに見られる閾値である。 閾値はクロス集計表の周辺度数から近似することも出来る (2step-ML) ↓ 天井効果・床効果のような歪みを閾値で適切に調節するイ メージ。 なので,一般的にカテゴリカルな相関係数のほうが(無理矢 理等間隔性を仮定している)ピアソンの相関係数よりも大き くなる。 相関係数が大きくなるので,因子も引っ張りだしやすくなる。Kosugi,E.Koji (Yamadai.R) Categorical Factor Analysis by using R 2012/10/05 8/9
  9. 9. Follow me with R code...以下コード Kosugi,E.Koji (Yamadai.R) Categorical Factor Analysis by using R 2012/10/05 9/9
  10. 10. > library(psych)> library(polycor)> # sample statistics> sample <- read.csv("cEFAsample.csv",head=F,na.strings="*")> head(sample) V1 V2 V3 V4 V5 V6 V7 V81 1 1 1 1 4 1 1 12 3 4 4 1 4 4 1 13 3 4 4 3 4 3 3 44 2 4 5 2 2 4 1 45 2 2 2 3 4 2 2 36 3 3 5 3 3 2 2 3> summary(sample) V1 V2 V3 V4 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 1st Qu.:3.500 1st Qu.:4.000 1st Qu.:4.000 1st Qu.:3.000 Median :4.000 Median :4.000 Median :4.000 Median :4.000 Mean :3.913 Mean :4.127 Mean :3.901 Mean :3.853 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 NAs :2 NAs :1 V5 V6 V7 V8Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.0001st Qu.:4.000 1st Qu.:3.000 1st Qu.:2.00 1st Qu.:3.000Median :4.000 Median :3.000 Median :3.00 Median :4.000Mean :3.955 Mean :3.138 Mean :2.78 Mean :3.4423rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.000Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000 NAs :1> table(sample$V1) 1 2 3 4 5 2 26 61 178 88> describe(sample) var n mean sd median trimmed mad min max range skew kurtosis seV1 1 355 3.91 0.87 4 4.00 0.00 1 5 4 -0.70 0.22 0.05V2 2 355 4.13 0.78 4 4.22 0.00 1 5 4 -1.11 2.08 0.04V3 3 353 3.90 0.78 4 3.95 0.00 1 5 4 -0.76 0.96 0.04V4 4 354 3.85 0.90 4 3.94 0.00 1 5 4 -0.82 0.66 0.05V5 5 355 3.95 0.87 4 4.04 1.48 1 5 4 -0.71 0.24 0.05V6 6 355 3.14 0.95 3 3.16 1.48 1 5 4 -0.22 -0.12 0.05V7 7 354 2.78 1.01 3 2.79 1.48 1 5 4 0.14 -0.70 0.05V8 8 355 3.44 1.00 4 3.47 1.48 1 5 4 -0.42 -0.28 0.05 1
  11. 11. > # peason cor> peason.cor <- cor(sample,use="complete.obs")> print(peason.cor,digit=2) V1 V2 V3 V4 V5 V6 V7 V8V1 1.00 0.380 0.43 0.40 0.26 0.19 0.285 0.26V2 0.38 1.000 0.28 0.34 0.27 0.16 0.099 0.21V3 0.43 0.277 1.00 0.26 0.21 0.15 0.150 0.16V4 0.40 0.339 0.26 1.00 0.42 0.26 0.276 0.23V5 0.26 0.265 0.21 0.42 1.00 0.23 0.255 0.22V6 0.19 0.157 0.15 0.26 0.23 1.00 0.341 0.39V7 0.29 0.099 0.15 0.28 0.26 0.34 1.000 0.41V8 0.26 0.212 0.16 0.23 0.22 0.39 0.415 1.00> # polychoric cor> polychoric.cor <- polychoric(sample)> print(polychoric.cor$rho) V1 V2 V3 V4 V5 V6 V7V1 1.0000000 0.4693292 0.4993862 0.4702445 0.3260640 0.2015360 0.3172379V2 0.4693292 1.0000000 0.3661174 0.4283065 0.3544777 0.1925806 0.1164603V3 0.4993862 0.3661174 1.0000000 0.3131351 0.2971062 0.1704954 0.1565841V4 0.4702445 0.4283065 0.3131351 1.0000000 0.5128292 0.2805638 0.3020316V5 0.3260640 0.3544777 0.2971062 0.5128292 1.0000000 0.2612329 0.2785856V6 0.2015360 0.1925806 0.1704954 0.2805638 0.2612329 1.0000000 0.3832876V7 0.3172379 0.1164603 0.1565841 0.3020316 0.2785856 0.3832876 1.0000000V8 0.2939444 0.2544516 0.1885443 0.2562720 0.2513339 0.4138156 0.4444297 V8V1 0.2939444V2 0.2544516V3 0.1885443V4 0.2562720V5 0.2513339V6 0.4138156V7 0.4444297V8 1.0000000> #> # compare, peason vs polycor> #>> # FA> fa.parallel(peason.cor,n.obs=355)Parallel analysis suggests that the number of factors = 3 and the number of components = 2
  12. 12. > fa.parallel(polychoric.cor$rho,n.obs=355)Parallel analysis suggests that the number of factors = 3 and the number of components => fa.result.peason <- fa(peason.cor,n.obs=355,fm="gls",nfactors=3,rotate="promax")> fa.result.polych <- fa(polychoric.cor$rho,n.obs=355,fm="gls",nfactors=3,rotate="promax")> print(fa.result.peason,digit=3,sort=T)Factor Analysis using method = glsCall: fa(r = peason.cor, nfactors = 3, n.obs = 355, rotate = "promax", fm = "gls")Standardized loadings (pattern matrix) based upon correlation matrix item GLS2 GLS1 GLS3 h2 u2V8 8 0.695 0.073 -0.076 0.468 0.532V7 7 0.583 0.032 0.028 0.377 0.623V6 6 0.529 -0.062 0.121 0.333 0.667V1 1 0.056 0.886 -0.091 0.721 0.279V3 3 -0.006 0.453 0.083 0.261 0.739V4 4 -0.015 0.017 0.696 0.490 0.510V5 5 0.023 -0.145 0.692 0.377 0.623V2 2 -0.063 0.289 0.313 0.273 0.727 GLS2 GLS1 GLS3SS loadings 1.140 1.094 1.065Proportion Var 0.143 0.137 0.133Cumulative Var 0.143 0.279 0.412Proportion Explained 0.346 0.332 0.323Cumulative Proportion 0.346 0.677 1.000 With factor correlations of GLS2 GLS1 GLS3GLS2 1.000 0.409 0.566GLS1 0.409 1.000 0.688GLS3 0.566 0.688 1.000Test of the hypothesis that 3 factors are sufficient.The degrees of freedom for the null model are 28 and the objective function was 1.469 witThe degrees of freedom for the model are 7 and the objective function was 0.03The root mean square of the residuals (RMSR) is 0.014The df corrected root mean square of the residuals is 0.04The number of observations was 355 with Chi Square = 10.559 with prob < 0.159Tucker Lewis Index of factoring reliability = 0.9706RMSEA index = 0.0388 and the 90 % confidence intervals are NA 0.0814BIC = -30.546 3
  13. 13. Fit based upon off diagonal values = 0.995Measures of factor score adequacy GLS2 GLS1 GLS3Correlation of scores with factors 0.830 0.885 0.851Multiple R square of scores with factors 0.689 0.783 0.723Minimum correlation of possible factor scores 0.378 0.565 0.447> print(fa.result.polych,digit=3,sort=T)Factor Analysis using method = glsCall: fa(r = polychoric.cor$rho, nfactors = 3, n.obs = 355, rotate = "promax", fm = "gls")Standardized loadings (pattern matrix) based upon correlation matrix item GLS3 GLS1 GLS2 h2 u2V5 5 0.806 -0.179 0.029 0.497 0.503V4 4 0.709 0.024 0.020 0.543 0.457V2 2 0.383 0.312 -0.069 0.376 0.624V1 1 -0.138 0.976 0.069 0.826 0.174V3 3 0.145 0.470 -0.028 0.326 0.674V7 7 -0.019 0.052 0.657 0.447 0.553V8 8 -0.038 0.097 0.650 0.452 0.548V6 6 0.143 -0.083 0.555 0.365 0.635 GLS3 GLS1 GLS2SS loadings 1.319 1.289 1.226Proportion Var 0.165 0.161 0.153Cumulative Var 0.165 0.326 0.479Proportion Explained 0.344 0.336 0.320Cumulative Proportion 0.344 0.680 1.000 With factor correlations of GLS3 GLS1 GLS2GLS3 1.000 0.716 0.522GLS1 0.716 1.000 0.392GLS2 0.522 0.392 1.000Test of the hypothesis that 3 factors are sufficient.The degrees of freedom for the null model are 28 and the objective function was 1.986 witThe degrees of freedom for the model are 7 and the objective function was 0.055The root mean square of the residuals (RMSR) is 0.017The df corrected root mean square of the residuals is 0.048The number of observations was 355 with Chi Square = 19.207 with prob < 0.00756Tucker Lewis Index of factoring reliability = 0.9265 4
  14. 14. RMSEA index = 0.0711 and the 90 % confidence intervals are 0.0335 0.1085BIC = -21.897Fit based upon off diagonal values = 0.995Measures of factor score adequacy GLS3 GLS1 GLS2Correlation of scores with factors 0.882 0.928 0.839Multiple R square of scores with factors 0.779 0.862 0.704Minimum correlation of possible factor scores 0.557 0.724 0.408> #> # sample <- subset(sample,select=c("V11","V13","V20","V5","V4","V17","V12","V15"))> # write.table(sample,"cEFAsample.csv",sep=",",row.name=F,col.name=F,na="*")>>> # mixed pattern> sample.cat <- data.frame(lapply(sample[1:3],factor),sample[4:8])> summary(sample.cat) V1 V2 V3 V4 V5 V6 1: 2 1: 3 1 : 2 Min. :1.000 Min. :1.000 Min. :1.000 2: 26 2: 13 2 : 18 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:3.000 3: 61 3: 31 3 : 60 Median :4.000 Median :4.000 Median :3.000 4:178 4:197 4 :206 Mean :3.853 Mean :3.955 Mean :3.138 5: 88 5:111 5 : 67 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000 NAs: 2 Max. :5.000 Max. :5.000 Max. :5.000 NAs :1 V7 V8 Min. :1.00 Min. :1.000 1st Qu.:2.00 1st Qu.:3.000 Median :3.00 Median :4.000 Mean :2.78 Mean :3.442 3rd Qu.:4.00 3rd Qu.:4.000 Max. :5.00 Max. :5.000 NAs :1> hetcor.cor <- hetcor(sample.cat)> hetcor.cor$correlations V1 V2 V3 V4 V5 V6 V7V1 1.0000000 0.4766232 0.4902862 0.4305458 0.2853987 0.2076320 0.3015123V2 0.4766232 1.0000000 0.3740222 0.3757560 0.3093574 0.1839428 0.1175596V3 0.4902862 0.3740222 1.0000000 0.2752806 0.2491626 0.1686548 0.1583849V4 0.4305458 0.3757560 0.2752806 1.0000000 0.4202661 0.2636989 0.2758351V5 0.2853987 0.3093574 0.2491626 0.4202661 1.0000000 0.2279503 0.2550014V6 0.2076320 0.1839428 0.1686548 0.2636989 0.2279503 1.0000000 0.3414939V7 0.3015123 0.1175596 0.1583849 0.2758351 0.2550014 0.3414939 1.0000000V8 0.2663878 0.2378540 0.1553257 0.2324400 0.2175612 0.3937855 0.4146257 5
  15. 15. V8V1 0.2663878V2 0.2378540V3 0.1553257V4 0.2324400V5 0.2175612V6 0.3937855V7 0.4146257V8 1.0000000> hetcor.cor$type [,1] [,2] [,3] [,4] [,5][1,] "" "Polychoric" "Polychoric" "Polyserial" "Polyserial"[2,] "Polychoric" "" "Polychoric" "Polyserial" "Polyserial"[3,] "Polychoric" "Polychoric" "" "Polyserial" "Polyserial"[4,] "Polyserial" "Polyserial" "Polyserial" "" "Pearson"[5,] "Polyserial" "Polyserial" "Polyserial" "Pearson" ""[6,] "Polyserial" "Polyserial" "Polyserial" "Pearson" "Pearson"[7,] "Polyserial" "Polyserial" "Polyserial" "Pearson" "Pearson"[8,] "Polyserial" "Polyserial" "Polyserial" "Pearson" "Pearson" [,6] [,7] [,8][1,] "Polyserial" "Polyserial" "Polyserial"[2,] "Polyserial" "Polyserial" "Polyserial"[3,] "Polyserial" "Polyserial" "Polyserial"[4,] "Pearson" "Pearson" "Pearson"[5,] "Pearson" "Pearson" "Pearson"[6,] "" "Pearson" "Pearson"[7,] "Pearson" "" "Pearson"[8,] "Pearson" "Pearson" ""> fa.parallel(hetcor.cor$correlations,n.obs=355)Parallel analysis suggests that the number of factors = 3 and the number of components => fa.result.hetcor <- fa(hetcor.cor$correlations,n.obs=355,fm="gls",nfactors=3,rotate="proma> print(fa.result.hetcor,digit=3,sort=T)Factor Analysis using method = glsCall: fa(r = hetcor.cor$correlations, nfactors = 3, n.obs = 355, rotate = "promax", fm = "gls")Standardized loadings (pattern matrix) based upon correlation matrix item GLS1 GLS2 GLS3 h2 u2V1 1 0.868 0.082 -0.101 0.695 0.305V3 3 0.599 -0.029 0.017 0.359 0.641V2 2 0.459 -0.058 0.235 0.384 0.616V8 8 0.077 0.686 -0.075 0.460 0.540 6
  16. 16. V7 7 0.020 0.597 0.034 0.391 0.609V6 6 -0.031 0.520 0.109 0.328 0.672V5 5 -0.131 -0.004 0.735 0.420 0.580V4 4 0.078 0.014 0.613 0.459 0.541 GLS1 GLS2 GLS3SS loadings 1.364 1.144 0.988Proportion Var 0.171 0.143 0.123Cumulative Var 0.171 0.314 0.437Proportion Explained 0.390 0.327 0.283Cumulative Proportion 0.390 0.717 1.000 With factor correlations of GLS1 GLS2 GLS3GLS1 1.000 0.409 0.704GLS2 0.409 1.000 0.552GLS3 0.704 0.552 1.000Test of the hypothesis that 3 factors are sufficient.The degrees of freedom for the null model are 28 and the objective function was 1.705 witThe degrees of freedom for the model are 7 and the objective function was 0.046The root mean square of the residuals (RMSR) is 0.016The df corrected root mean square of the residuals is 0.046The number of observations was 355 with Chi Square = 16.046 with prob < 0.0247Tucker Lewis Index of factoring reliability = 0.9361RMSEA index = 0.0613 and the 90 % confidence intervals are 0.0203 0.0998BIC = -25.059Fit based upon off diagonal values = 0.994Measures of factor score adequacy GLS1 GLS2 GLS3Correlation of scores with factors 0.892 0.829 0.849Multiple R square of scores with factors 0.795 0.688 0.721Minimum correlation of possible factor scores 0.590 0.375 0.441> 7

×