「マンガでわかる統計学[因子分析編]」の主成分分析をRでやる

やったのでメモっておく。

「第4章主成分分析」より。

主成分と主成分得点を求める

Step0. データ入力

library(tibble)
ramen <- tribble(
  ~店舗,    ~麺, ~具, ~スープ,
  "二郎",       2, 4, 5,
  "夢田屋",     1, 5, 1,
  "地回",       5, 3, 4,
  "なのはな",   2, 2, 3,
  "花ぶし",     3, 5, 5,
  "昇辰軒",     4, 3, 2,
  "丸蔵らあめん", 4, 4, 3,
  "海楽亭",     1, 2, 1,
  "なるみ家",   3, 3, 2,
  "奏月",       5, 5, 3
)

Step1. 変数ごとに基準化する

library(dplyr)

ramen %>% 
  mutate_if(is.numeric, scale) ->
  ramen_scaled
ramen_scaled

## # A tibble: 10 x 4
##    店舗             麺     具  スープ
##    <chr>         <dbl>  <dbl>   <dbl>
##  1 二郎         -0.671  0.341  1.45  
##  2 夢田屋       -1.34   1.19  -1.31  
##  3 地回          1.34  -0.511  0.759 
##  4 なのはな     -0.671 -1.36   0.0690
##  5 花ぶし        0      1.19   1.45  
##  6 昇辰軒        0.671 -0.511 -0.621 
##  7 丸蔵らあめん  0.671  0.341  0.0690
##  8 海楽亭       -1.34  -1.36  -1.31  
##  9 なるみ家      0     -0.511 -0.621 
## 10 奏月          1.34   1.19   0.0690

Step2. 相関行列を求める

ramen_scaled %>% 
  select(-店舗) %>% 
  cor ->
  ramen_cormat
ramen_cormat

##               麺        具    スープ
## 麺     1.0000000 0.1905002 0.3600411
## 具     0.1905002 1.0000000 0.3004804
## スープ 0.3600411 0.3004804 1.0000000

Step3. 固有値と固有ベクトルを求める

ramen_decomp <- eigen(ramen_cormat)
ramen_decomp

## eigen() decomposition
## $values
## [1] 1.5728539 0.8140083 0.6131378
## 
## $vectors
##            [,1]       [,2]       [,3]
## [1,] -0.5715110  0.6044710  0.5549685
## [2,] -0.5221161 -0.7896069  0.3223595
## [3,] -0.6330639  0.1055260 -0.7668731

※固有ベクトルは定数倍の任意性があるので、テキストと一緒にならない可能性があり、実際なっていない。

Step4

Step3における
- 最大の固有値に対応する固有ベクトル
- 最大から2番目の固有値に対応する固有ベクトル
- をもとに散布図を書く

library(ggplot2)
library(ggrepel)
target <- 1:2
target_vec <- ramen_decomp$vectors[, target]
as.data.frame(target_vec) %>% 
  mutate(item = c("麺", "具", "スープ")) %>% 
  ggplot(aes(x = -V1, y = -V2)) + # テキストに合わせるために反転
  geom_point() +
  geom_text_repel(aes(label = item), family = "IPAexGothic") +
  lims(x = c(0, 1), y = c(-1, 1)) +
  geom_vline(xintercept = 0, col = "gray") +
  geom_hline(yintercept = 0, col = "gray") +
  theme_minimal()

f:id:Rion778:20181003211033p:plain

Step5. 第一主成分と第二主成分を理解する。

第一主成分: 最大の固有値に対応する固有ベクトル。
- 上記例では-0.571511, -0.5221161, -0.6330639
- 頭から順に麺、具、スープに対応している。
第二主成分: 二番目の固有値に対応する固有ベクトル。
- 上記例では0.604471, -0.7896069, 0.105526
- 同様に、n番目の固有値に対応する固有ベクトルが第n主成分となる。

Step6 各個体の第一主成分と第二主成分における座標、つまり各個体の第一主成分得点と第二主成分得点を求める。

ramen_scaled %>% 
  rowwise() %>% 
  mutate(
    score1 = -sum(c(麺, 具, スープ) * ramen_decomp$vectors[, 1]),
    score2 = -sum(c(麺, 具, スープ) * ramen_decomp$vectors[, 2])
  )

## Source: local data frame [10 x 6]
## Groups: <by row>
## 
## # A tibble: 10 x 6
##    店舗             麺     具  スープ score1 score2
##    <chr>         <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
##  1 二郎         -0.671  0.341  1.45    0.712  0.522
##  2 夢田屋       -1.34   1.19  -1.31   -0.974  1.89 
##  3 地回          1.34  -0.511  0.759   0.980 -1.29 
##  4 なのはな     -0.671 -1.36   0.0690 -1.05  -0.678
##  5 花ぶし        0      1.19   1.45    1.54   0.789
##  6 昇辰軒        0.671 -0.511 -0.621  -0.277 -0.744
##  7 丸蔵らあめん  0.671  0.341  0.0690  0.605 -0.144
##  8 海楽亭       -1.34  -1.36  -1.31   -2.31  -0.127
##  9 なるみ家      0     -0.511 -0.621  -0.660 -0.338
## 10 奏月          1.34   1.19   0.0690  1.43   0.124

Step7 第一主成分と第二主成分に基づいてプロットする。

ramen_scaled %>% 
  rowwise() %>% 
  mutate( # テキストに合わせるために-1を掛けて反転させる
    score1 = -sum(c(麺, 具, スープ) * ramen_decomp$vectors[, 1]),
    score2 = -sum(c(麺, 具, スープ) * ramen_decomp$vectors[, 2])
  ) %>% 
  ggplot(aes(x = score1, y = score2)) +
  geom_point() +
  geom_text_repel(aes(label = 店舗), family = "IPAexGothic") +
  geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0) +
  theme_minimal()

f:id:Rion778:20181003211155p:plain

分析結果の精度を確認する

寄与率を計算。

ramen_decomp$values / 3

## [1] 0.5242846 0.2713361 0.2043793

累積寄与率を計算

cumsum(ramen_decomp$values) / 3 * 100

## [1]  52.42846  79.56207 100.00000

同じことをRでやる

Rで、というかprcompで。

ramen %>%
  select(-店舗) %>% 
  prcomp(scale = TRUE) -> result
result

## Standard deviations (1, .., p=3):
## [1] 1.2541347 0.9022241 0.7830312
## 
## Rotation (n x k) = (3 x 3):
##              PC1        PC2        PC3
## 麺     0.5715110 -0.6044710  0.5549685
## 具     0.5221161  0.7896069  0.3223595
## スープ 0.6330639 -0.1055260 -0.7668731

summary(result)

## Importance of components:
##                           PC1    PC2    PC3
## Standard deviation     1.2541 0.9022 0.7830
## Proportion of Variance 0.5243 0.2713 0.2044
## Cumulative Proportion  0.5243 0.7956 1.0000

先程の計算と一致していることを確認（※主成分に定数倍の差はある）。

標準偏差は固有値の平方根に一致する。

sqrt(ramen_decomp$values)

## [1] 1.2541347 0.9022241 0.7830312

ramen_decomp

## eigen() decomposition
## $values
## [1] 1.5728539 0.8140083 0.6131378
## 
## $vectors
##            [,1]       [,2]       [,3]
## [1,] -0.5715110  0.6044710  0.5549685
## [2,] -0.5221161 -0.7896069  0.3223595
## [3,] -0.6330639  0.1055260 -0.7668731

結果はbiplotで可視化できる。

biplot(result, family = "IPAexGothic", xlabs = ramen$店舗)

f:id:Rion778:20181003211949p:plain

ところで、ggplotを使いたいと思ってggfortify、ggbiplot、factoextraの3つのパッケージを試してみたのだが、現状日本語が問題なく使えるのはggfortifyだけのようだった。

ggfortifyはパッケージロード後にautoplotを通じてプロットをするが、引数を確認するには?ggfortify::ggbiplotを見る必要があった。

library(ggfortify)
autoplot(
  result2, 
  loadings = TRUE, 
  loadings.label = TRUE, 
  loadings.label.family = "IPAexGothic",
  loadings.label.repel = TRUE,
  label = TRUE,
  label.family = "IPAexGothic",
  label.repel = TRUE,
  scale = 0
) +
  theme_bw()

f:id:Rion778:20181003212928p:plain