独立成分分析(ICA)

背景

ICA最早应用于盲源信号分离,起源于"鸡尾酒会问题",描述如下:在嘈杂的鸡尾酒会上,许多人在同时交谈,可能还有背景音乐,但人耳却能准确而清晰的听到对方 的话.这种可以从混合声音中选择自己感兴趣的声音而忽略其他声音的现象被称为"鸡尾酒会效应".

目标

原始数据分离或者盲源分离.

特点

信号源独立/信号源呈现非正态分布.

描述

已知:有$n$个人同时说话,有$n$个麦克风同时采集了$m$个点的语音数据. 目标:从$n$个麦克风采集的语音数据中恢复$n$个人的语音. 任务: 找到混淆矩阵$A_{n*n}$. 设源为$S_{n*m}=( S^1,S^2,\cdots,S^m)$,采集数据为$X_{n*m}=(X^1,X^2,\cdots,X^m)$. 设存在混淆矩阵$A_{n*n}$,即权重矩阵,使得下式成立: $$X_{n*m}=A_{n*n}S_{n*m}$$ $X^i$表示第$i$时刻$n$个麦克风采集的语音信号,$X^i \in R^{n}$,$S^{i}$表示第$i$时刻$n$个人发出的语音信号,$S^i \in R^{n}$,有 $$ X_{n*1}^{i}=A_{n*n}S_{n*1}^{i}$$ 在论坛上找到了一张解释ICA的图,如下图1 image.png

要求解矩阵$A_{n*n}$有很多种算法,如FastICA(fixed point)、极大似然的自然梯度法(infomax 算法/Lee et al.的ExtICA算法)等,详细可参考独立成分分析的若干算法及其研究http://www.wanfangdata.com.cn/details/detail.do?_type=degree&id=Y714607#

算法步骤

  1. 将原始数据组成矩阵$X_{n*m}$.
  2. 将$X$的每一行零均值化,即减去这一行的均值.
  3. 对数据进行白化预处理.
  4. 设置参数学习率$\alpha$的值.
  5. 在第$i$时刻求解$W$(W=$A^{-1}$),其中初始$W$可赋值为每行之和为1的随机矩阵.
  6. 根据上一步(5)求得的$W$和$S_{n*1}^{i}=W_{n*n}X_{n*1}^{i}$解得$S_{n*1}^{i}$.
  7. 重复4和5求解所有时刻的源信号$S_{n*1}^{i}$.
  8. 组合各时刻的声源信号,得到$S_{n*m}=(S^1,S^2,\cdots,S^m)$.

在R中有很多包可以做ICA,常用的包为FastICA,ProdenICA,PearsonICA.以下分别用ProdenICA和fastICA做ICA.

通常用fastICA进行ICA,不需对原始数据进行处理,其内置函数fastICA( )会对数据进行中心化、白化预处理.

Usage

fastICA(X, n.comp, alg.typ = c("parallel","deflation"), fun = c("logcosh","exp"), alpha = 1.0, method = c("R","C"), row.norm = FALSE, maxit = 200, tol = 1e-04, verbose = FALSE, w.init = NULL)其中alg.typ=="parallel"表示将所有元素一起提取独立因素;alg.typ=="deflation"表示分别提取独立因素.

In [1]:
library(fastICA)
#example1:unmixing  two independent signals
S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5))
A <- matrix(c(0.291, 0.6557, -0.5439, 0.5572), 2, 2)
X <- S %*% A

a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1, 
             method = "R", row.norm = FALSE, maxit = 200, 
             tol = 0.0001, verbose = TRUE)

par(mfcol = c(2, 3))
plot(1:1000, S[,1 ], type = "l", main = "Original Signals", 
     xlab = "", ylab = "")
plot(1:1000, S[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, X[,1 ], type = "l", main = "Mixed Signals", 
     xlab = "", ylab = "")
plot(1:1000, X[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, a$S[,1 ], type = "l", main = "ICA source estimates", 
     xlab = "", ylab = "")
plot(1:1000, a$S[, 2], type = "l", xlab = "", ylab = "")
Warning message:
"package 'fastICA' was built under R version 3.6.3"
Centering

Whitening

Symmetric FastICA using logcosh approx. to neg-entropy function

Iteration 1 tol = 0.1130178

Iteration 2 tol = 0.008571691

Iteration 3 tol = 7.195949e-06

In [1]:
#example 2:using fastICA  to perform projection pursuit on a mixture of bivariate normal distribution(二元正态分布)
library(fastICA)
if(require(MASS)){
  x <- mvrnorm(n = 1000, mu = c(0, 0), Sigma = matrix(c(10, 3, 3, 1), 2, 2))
  x1 <- mvrnorm(n = 1000, mu = c(-1, 2), Sigma = matrix(c(10, 3, 3, 1), 2, 2))
  X <- rbind(x, x1)
  
  a <- fastICA(X, 2, alg.typ = "deflation", fun = "logcosh", alpha = 1,
               method = "R", row.norm = FALSE, maxit = 200, 
               tol = 0.0001, verbose = TRUE)
  
  par(mfrow = c(1, 3))
  plot(a$X, main = "Pre-processed data")
  plot(a$X %*% a$K, main = "PCA components")
  plot(a$S, main = "ICA components")
}
Warning message:
"package 'fastICA' was built under R version 3.6.3"
Loading required package: MASS

Centering

Whitening

Deflation FastICA using logcosh approx. to neg-entropy function

Component 1

Iteration 1 tol = 0.1212494

Iteration 2 tol = 1.67444e-06

Component 2

Iteration 1 tol = 0

用ProDenICA包模拟ICA过程.假设有两组独立同分布的数据,分布为$n$,每组数据个数为1024,混淆矩阵$A_0$.

In [3]:
library(ProDenICA)
# 获取原始矩阵X
p=2
dist="n"
N=1024
A0=mixmat(p)
s<-scale(cbind(rjordan(dist,N),rjordan(dist,N)))
X=s%*%A0
#whiten
X<-scale(X,center=T,scale=F)#center
c<-svd(X)
X<-sqrt(N)*(c$u)#satifiy cov(X)=1
target<-solve(A0)#求A0的逆
target<-diag(c$d)%*%t(c$v)%*%target/sqrt(N)#new mixing matrix
#ProDenICA法估计W1,直接调用ProDenICA包
W0<-matrix(rnorm(2*2),2,2)
W0<-ICAorthW(W0)  
W1<-ProDenICA(X,W0=W0,trace=T,Gfuns=Gpois)$W
amari(W1,target)
fit<-ProDenICA(X, W0=W0,Gfunc=GPois,trace=TRUE, density=TRUE)
W2<-fit$W
amari(W2,target)#计算amari距离
par(mfrow=c(2,1))
plot(fit)
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 1 G 0.006886266 crit 0.2108242 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 2 G 0.01632863 crit 0.1026172 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 3 G 0.01916471 crit 0.02231671 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 4 G 0.01931842 crit 0.006026472 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 5 G 0.01930352 crit 0.001855384 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 6 G 0.01931709 crit 0.0004879276 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 7 G 0.01930512 crit 7.499904e-05 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 8 G 0.01930217 crit 5.658234e-05 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 9 G 0.01929947 crit 4.812909e-05 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 10 G 0.01929539 crit 2.945552e-06 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 11 G 0.01929539 crit 2.840301e-07 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 12 G 0.01929539 crit 2.771636e-08 
0.0114814217272098
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 1 G 0.006886266 crit 0.2108242 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 2 G 0.01632863 crit 0.1026172 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 3 G 0.01916471 crit 0.02231671 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 4 G 0.01931842 crit 0.006026472 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 5 G 0.01930352 crit 0.001855384 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 6 G 0.01931709 crit 0.0004879276 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 7 G 0.01930512 crit 7.499904e-05 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 8 G 0.01930217 crit 5.658234e-05 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 9 G 0.01929947 crit 4.812909e-05 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 10 G 0.01929539 crit 2.945552e-06 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 11 G 0.01929539 crit 2.840301e-07 
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Warning message in model.matrix.default(mt, mf, contrasts):
"non-list contrasts argument ignored"
Iter 12 G 0.01929539 crit 2.771636e-08 
0.0114814217272098

主成分分析(PCA)

PCA是一种广泛使用的数据降维算法,主要思想基于SVD以及协方差矩阵的特征根.在R中使用prcomp函数以及princomp函数进行PCA,前者基于SVD,后者基于协方差矩阵的特征根.一般对原始矩阵做中心化与标准化,其目的是去相关性. $$X=(X_1,X_2,\cdots,X_k)$$ $$PC_{i}=\sum_{j=1}^{k}a_{j}*X_{j},i=1,2,\cdots,m(m\leq k)$$

R Packages for PCA

PCA常用包为psych包、FactoMineR包,且可以通过factoextra包和ggbiplot包提取以及可视化PCA.

PCA绘图以及可视化R包

下面的测试数据来自Science:拟南芥三萜化合物特异调控根系微生物组,数据可在http://github.com/YongxinLiu/MicrobiomeStatPlot 【目录223PCA】获取.

In [13]:
dir<-"http://210.75.224.110/github/MicrobiomeStatPlot/Data/Science2019/"#从网页中读取数据
mdata<-read.table(paste0(dir,"metadata.txt"),header=T,row.names=1,
                  sep="\t",comment.char="",stringsAsFactors = F)
head(mdata,3)
odata<-read.table(paste0(dir,"otutab.txt"),header=T,row.names=1,
                  sep="\t",comment.char="",stringsAsFactors = F)
#过滤数据并且排序
sub_mdata<-mdata[rownames(mdata)%in%colnames(odata),]
sub_odata<-odata[,rownames(sub_mdata)]
#基于OTU表PCA分析
#若变量之间的数据处于不同数量级/变量之间的均值/方差很大,建议进行标准化
o.pca<-prcomp(t(sub_odata),scale = T)
library(factoextra)#提取PCA数据
get_eigenvalue(o.pca)[1:3,]
#绘制碎石图即贡献率
fviz_eig(o.pca,addlabels=T)
#绘制变量PCA主成分分析图
fviz_pca_var(o.pca,col.var="contrib",gradient.col=c("#00AFBB","#E7B800",
                                          "#FC4E07"       ))
#分析观测值,先提取出individuals
ind<-get_pca_ind(o.pca)
ind
#样本PCA图,repel避免标签重叠
fviz_pca_ind(o.pca,pointsize="cos2",
             repel = T,col.ind = mdata$Group)
fviz_pca_ind(o.pca,geom.ind="point",col.ind=mdata$Group,
             palette=c("#00AFBB","#E7B800","#FC4E07"),
             addEllipses = T,legend.title="Groups")
A data.frame: 3 × 13
GroupDateSiteCRACRRBarcodeSequenceLinkerPrimerSequenceReversePrimerBatchSpeciesSequencingPlatformDescription
<chr><chr><chr><chr><chr><chr><chr><chr><int><chr><chr><chr><chr>
KO1KO2017/6/30Chaoyang CRA002352 CRR117575ACGCTCGACAAACMGGATTAGATACCCKGACGTCATCCCCACCTTCC1Arabidopsis thalianaNovageneHiSeq2500Knock-out replicate 1
KO2KO2017/6/30Chaoyang CRA002352 CRR117576ATCAGACACGAACMGGATTAGATACCCKGACGTCATCCCCACCTTCC1Arabidopsis thalianaNovageneHiSeq2500Knock-out replicate 2
KO3KO2017/7/2 ChangpingCRA002352 CRR117577ATATCGCGAGAACMGGATTAGATACCCKGACGTCATCCCCACCTTCC1Arabidopsis thalianaNovageneHiSeq2500Knock-out replicate 3
A data.frame: 3 × 3
eigenvaluevariance.percentcumulative.variance.percent
<dbl><dbl><dbl>
Dim.1373.290314.18276414.18276
Dim.2211.3096 8.02848122.21124
Dim.3183.0639 6.95531429.16656
Principal Component Analysis Results for individuals
 ===================================================
  Name       Description                       
1 "$coord"   "Coordinates for the individuals" 
2 "$cos2"    "Cos2 for the individuals"        
3 "$contrib" "contributions of the individuals"

根据碎石图以及累计贡献率确定主成分个数

  1. 一般取累计贡献率大于80%的成分(主要由特征值大小确定)为需要提取的主成分.(主要依据)
  2. 碎石图一般选取第一个拐点前(包括该点所对应的成分)为需要提取的主成分.
  3. 一般主成分分析可直接调用R自带函数prcomp(),形如prcomp(X,scale=T)/princomp(X,cor=T). 前者载荷因子为prcomp.rotation,后者载荷因子为prcomp.loadings.
In [30]:
data(USArrests)
u.pca<-prcomp(USArrests,scale=T)#直接数据矩阵
u.pca$rotation#因子载荷矩阵
summary(u.pca)#summary()可提取u.pca的累计贡献率,有利于判断主成分
screeplot(u.pca,type="l",main=" scree plot")#基于方差的碎石图
#调用princomp()
u1<-princomp(USArrests,cor=T,scores=T)
u1$loadings#因子载荷矩阵
summary(u1)
screeplot(u1,type="l",main="scree plot2")
A matrix: 4 × 4 of type dbl
PC1PC2PC3PC4
Murder-0.5358995 0.4181809-0.3412327 0.64922780
Assault-0.5831836 0.1879856-0.2681484-0.74340748
UrbanPop-0.2781909-0.8728062-0.3780158 0.13387773
Rape-0.5434321-0.1673186 0.8177779 0.08902432
Standard deviations (1, .., p=4):
[1] 1.5748783 0.9948694 0.5971291 0.4164494

Rotation (n x k) = (4 x 4):
                PC1        PC2        PC3         PC4
Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
Rape     -0.5434321 -0.1673186  0.8177779  0.08902432
Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.5749 0.9949 0.59713 0.41645
Proportion of Variance 0.6201 0.2474 0.08914 0.04336
Cumulative Proportion  0.6201 0.8675 0.95664 1.00000
Loadings:
         Comp.1 Comp.2 Comp.3 Comp.4
Murder    0.536  0.418  0.341  0.649
Assault   0.583  0.188  0.268 -0.743
UrbanPop  0.278 -0.873  0.378  0.134
Rape      0.543 -0.167 -0.818       

               Comp.1 Comp.2 Comp.3 Comp.4
SS loadings      1.00   1.00   1.00   1.00
Proportion Var   0.25   0.25   0.25   0.25
Cumulative Var   0.25   0.50   0.75   1.00
Importance of components:
                          Comp.1    Comp.2    Comp.3     Comp.4
Standard deviation     1.5748783 0.9948694 0.5971291 0.41644938
Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
Cumulative Proportion  0.6200604 0.8675017 0.9566425 1.00000000

因子分析(EFA)

EFA是通过发掘隐藏在数据下的一组较少的、更为基本的、无法直接观测的变量,来解释一组已知可观测的变量的相关性,这些虚拟的变量被称为因子.这些虚拟的因子被认为可以解释多个观察可观测变量共有的方差.EFA

模型

$X_{i}=\sum_{j=1}^{p}a_{j}F_{j}+U_{i}$, $X_{i}$是第$i$个可观测变量,$F_{j}$是公共因子,$U_{i}$是$X_{i}$独有的部分.

R Packages for EFA

常用psych包以及FactoMineR包进行因子分析.

用psych包进行因子分析

以六个心理学测试间的相关性分析举例.112人参与了六个测试,分别为general,pictures,blocks,maze,reading,vocab,我们希望能够用一组潜在较少的心理学因素来解释参与者的测试得分.

提取公共因子

fa(r,nfactors= ,n.obs= ,rotate= ,scores= ,fm=)#r为相关系数矩阵或者原始数据矩阵,n.obs为观测值的个数,rotate表示旋转方法,scores=T/F 是否计算因子得分,fm表示提取公共因子的方法,如fm="pa"表示主轴迭代法,fm="ml"表示最大似然法,fm="minres"表示最小残差法(默认).

.

In [57]:
options(digits=2)#以下数据保留两位小数
cov<-ability.cov$cov
cor<-cov2cor(cov)#scales a covariance matrix into the corresponding correlation matrix efficiently
cor
library(psych)
#判断需要提取的公因子个数
fa.parallel(cor,n.obs=112,fa="both",n.iter=20,main="scree plots")#fa="both"显示PCA和EFA的结果,n.iter代表迭代次数
#提取公共因子
#主轴迭代
fa<-fa(cor,nfactors=2,rotate="none",fm="pa")
fa
#因子旋转
fa.v<-fa(cor,nfactors=2,rotate="varimax",fm="pa")#正交旋转
fa.v
fa.v$weights#因子得分
fa.diagram(fa.v,simple=F)#绘制正交结果图
A matrix: 6 × 6 of type dbl
generalpictureblocksmazereadingvocab
general1.000.470.550.340.580.51
picture0.471.000.570.190.260.24
blocks0.550.571.000.450.350.36
maze0.340.190.451.000.180.22
reading0.580.260.350.181.000.79
vocab0.510.240.360.220.791.00
Warning message in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
"The estimated weights for the factor scores are probably incorrect.  Try a different factor score estimation method."
Warning message in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, :
"An ultra-Heywood case was detected.  Examine the results carefully"
Warning message in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
"The estimated weights for the factor scores are probably incorrect.  Try a different factor score estimation method."
Warning message in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
"The estimated weights for the factor scores are probably incorrect.  Try a different factor score estimation method."
Warning message in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, :
"An ultra-Heywood case was detected.  Examine the results carefully"
Parallel analysis suggests that the number of factors =  2  and the number of components =  1 
Factor Analysis using method =  pa
Call: fa(r = cor, nfactors = 2, rotate = "none", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
         PA1   PA2   h2    u2 com
general 0.75  0.07 0.57 0.432 1.0
picture 0.52  0.32 0.38 0.623 1.7
blocks  0.75  0.52 0.83 0.166 1.8
maze    0.39  0.22 0.20 0.798 1.6
reading 0.81 -0.51 0.91 0.089 1.7
vocab   0.73 -0.39 0.69 0.313 1.5

                       PA1  PA2
SS loadings           2.75 0.83
Proportion Var        0.46 0.14
Cumulative Var        0.46 0.60
Proportion Explained  0.77 0.23
Cumulative Proportion 0.77 1.00

Mean item complexity =  1.5
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  15  and the objective function was  2.5
The degrees of freedom for the model are 4  and the objective function was  0.07 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.06 

Fit based upon off diagonal values = 0.99
Measures of factor score adequacy             
                                                   PA1  PA2
Correlation of (regression) scores with factors   0.96 0.92
Multiple R square of scores with factors          0.93 0.84
Minimum correlation of possible factor scores     0.86 0.68
Factor Analysis using method =  pa
Call: fa(r = cor, nfactors = 2, rotate = "varimax", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
         PA1  PA2   h2    u2 com
general 0.49 0.57 0.57 0.432 2.0
picture 0.16 0.59 0.38 0.623 1.1
blocks  0.18 0.89 0.83 0.166 1.1
maze    0.13 0.43 0.20 0.798 1.2
reading 0.93 0.20 0.91 0.089 1.1
vocab   0.80 0.23 0.69 0.313 1.2

                       PA1  PA2
SS loadings           1.83 1.75
Proportion Var        0.30 0.29
Cumulative Var        0.30 0.60
Proportion Explained  0.51 0.49
Cumulative Proportion 0.51 1.00

Mean item complexity =  1.3
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  15  and the objective function was  2.5
The degrees of freedom for the model are 4  and the objective function was  0.07 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.06 

Fit based upon off diagonal values = 0.99
Measures of factor score adequacy             
                                                   PA1  PA2
Correlation of (regression) scores with factors   0.96 0.92
Multiple R square of scores with factors          0.91 0.85
Minimum correlation of possible factor scores     0.82 0.71
A matrix: 6 × 2 of type dbl
PA1PA2
general 0.018 0.215
picture-0.008 0.097
blocks-0.198 0.794
maze 0.019 0.030
reading 0.842-0.224
vocab 0.191-0.020

阅读和词汇在第一因子上载荷较大,画图、积木图案和迷宫在第二因子上载荷较大,非语言的普通智力测量在两个因子上载荷较为平均, 这表明可能存在一个语言智力因子和一个非语言智力因子.

In [63]:
options(digits=2)#以下数据保留两位小数
cov<-ability.cov$cov
cor<-cov2cor(cov)#scales a covariance matrix into the corresponding correlation matrix efficiently
cor
library(psych)
#提取公共因子
#主轴迭代
fa<-fa(cor,nfactors=2,rotate="none",fm="pa")
fa
#因子旋转,允许两个因子相关时,可以使用斜交转轴法
library(GPArotation)
fa.promax <- fa(cor, nfactors = 2, rotate = "promax", fm = "pa")
fa.promax
A matrix: 6 × 6 of type dbl
generalpictureblocksmazereadingvocab
general1.000.470.550.340.580.51
picture0.471.000.570.190.260.24
blocks0.550.571.000.450.350.36
maze0.340.190.451.000.180.22
reading0.580.260.350.181.000.79
vocab0.510.240.360.220.791.00
Factor Analysis using method =  pa
Call: fa(r = cor, nfactors = 2, rotate = "none", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
         PA1   PA2   h2    u2 com
general 0.75  0.07 0.57 0.432 1.0
picture 0.52  0.32 0.38 0.623 1.7
blocks  0.75  0.52 0.83 0.166 1.8
maze    0.39  0.22 0.20 0.798 1.6
reading 0.81 -0.51 0.91 0.089 1.7
vocab   0.73 -0.39 0.69 0.313 1.5

                       PA1  PA2
SS loadings           2.75 0.83
Proportion Var        0.46 0.14
Cumulative Var        0.46 0.60
Proportion Explained  0.77 0.23
Cumulative Proportion 0.77 1.00

Mean item complexity =  1.5
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  15  and the objective function was  2.5
The degrees of freedom for the model are 4  and the objective function was  0.07 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.06 

Fit based upon off diagonal values = 0.99
Measures of factor score adequacy             
                                                   PA1  PA2
Correlation of (regression) scores with factors   0.96 0.92
Multiple R square of scores with factors          0.93 0.84
Minimum correlation of possible factor scores     0.86 0.68
Factor Analysis using method =  pa
Call: fa(r = cor, nfactors = 2, rotate = "promax", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
          PA1   PA2   h2    u2 com
general  0.37  0.48 0.57 0.432 1.9
picture -0.03  0.63 0.38 0.623 1.0
blocks  -0.10  0.97 0.83 0.166 1.0
maze     0.00  0.45 0.20 0.798 1.0
reading  1.00 -0.09 0.91 0.089 1.0
vocab    0.84 -0.01 0.69 0.313 1.0

                       PA1  PA2
SS loadings           1.83 1.75
Proportion Var        0.30 0.29
Cumulative Var        0.30 0.60
Proportion Explained  0.51 0.49
Cumulative Proportion 0.51 1.00

 With factor correlations of 
     PA1  PA2
PA1 1.00 0.55
PA2 0.55 1.00

Mean item complexity =  1.2
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  15  and the objective function was  2.5
The degrees of freedom for the model are 4  and the objective function was  0.07 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.06 

Fit based upon off diagonal values = 0.99
Measures of factor score adequacy             
                                                   PA1  PA2
Correlation of (regression) scores with factors   0.97 0.94
Multiple R square of scores with factors          0.93 0.88
Minimum correlation of possible factor scores     0.86 0.77

正交与斜交的不同之处在于,正交的因子分析重点是因子结构矩阵,即变量与因子的相关系数。而斜交的因子分析会考虑三个矩阵: 因子结构矩阵、因子模式矩阵和因子关联矩阵。 其中,因子模式矩阵即标准化的回归系数矩阵,$PA_1$和$PA_2$栏中的值组成了因子模式矩阵。 因子关联矩阵即因子相关系数矩阵。因子关联矩阵显示两个因子的相关系数为0.55(计算结果与原书上的数据略有差别),相关性很大。 因子结构矩阵(因子荷载阵)可以使用公式$F = P * \Phi$轻松得到,其中$F$是因子载荷阵,$P$为因子模式矩阵,$\Phi$为因子关联矩阵。下面的函数可以用于该乘法计算:

In [64]:
options(digits=2)#以下数据保留两位小数
cov<-ability.cov$cov
cor<-cov2cor(cov)#scales a covariance matrix into the corresponding correlation matrix efficiently
cor
library(psych)
#提取公共因子
#主轴迭代
fa<-fa(cor,nfactors=2,rotate="none",fm="pa")
fa
#因子旋转,允许两个因子相关时,可以使用斜交转轴法
library(GPArotation)
fa.promax <- fa(cor, nfactors = 2, rotate = "promax", fm = "pa")
fa.promax
fsm <- function(oblique) {
    if (class(oblique)[2]=="fa" & is.null(oblique$Phi)) {
        warning("Object doesn't look like oblique EFA")
     } else {
         P <- unclass(oblique$loading)
         F <- P %*% oblique$Phi
         colnames(F) <- c("PA1", "PA2")
         return(F)
     }
}
fsm(fa.promax)
fa.diagram(fa.promax)
A matrix: 6 × 6 of type dbl
generalpictureblocksmazereadingvocab
general1.000.470.550.340.580.51
picture0.471.000.570.190.260.24
blocks0.550.571.000.450.350.36
maze0.340.190.451.000.180.22
reading0.580.260.350.181.000.79
vocab0.510.240.360.220.791.00
Factor Analysis using method =  pa
Call: fa(r = cor, nfactors = 2, rotate = "none", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
         PA1   PA2   h2    u2 com
general 0.75  0.07 0.57 0.432 1.0
picture 0.52  0.32 0.38 0.623 1.7
blocks  0.75  0.52 0.83 0.166 1.8
maze    0.39  0.22 0.20 0.798 1.6
reading 0.81 -0.51 0.91 0.089 1.7
vocab   0.73 -0.39 0.69 0.313 1.5

                       PA1  PA2
SS loadings           2.75 0.83
Proportion Var        0.46 0.14
Cumulative Var        0.46 0.60
Proportion Explained  0.77 0.23
Cumulative Proportion 0.77 1.00

Mean item complexity =  1.5
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  15  and the objective function was  2.5
The degrees of freedom for the model are 4  and the objective function was  0.07 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.06 

Fit based upon off diagonal values = 0.99
Measures of factor score adequacy             
                                                   PA1  PA2
Correlation of (regression) scores with factors   0.96 0.92
Multiple R square of scores with factors          0.93 0.84
Minimum correlation of possible factor scores     0.86 0.68
Factor Analysis using method =  pa
Call: fa(r = cor, nfactors = 2, rotate = "promax", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
          PA1   PA2   h2    u2 com
general  0.37  0.48 0.57 0.432 1.9
picture -0.03  0.63 0.38 0.623 1.0
blocks  -0.10  0.97 0.83 0.166 1.0
maze     0.00  0.45 0.20 0.798 1.0
reading  1.00 -0.09 0.91 0.089 1.0
vocab    0.84 -0.01 0.69 0.313 1.0

                       PA1  PA2
SS loadings           1.83 1.75
Proportion Var        0.30 0.29
Cumulative Var        0.30 0.60
Proportion Explained  0.51 0.49
Cumulative Proportion 0.51 1.00

 With factor correlations of 
     PA1  PA2
PA1 1.00 0.55
PA2 0.55 1.00

Mean item complexity =  1.2
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  15  and the objective function was  2.5
The degrees of freedom for the model are 4  and the objective function was  0.07 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.06 

Fit based upon off diagonal values = 0.99
Measures of factor score adequacy             
                                                   PA1  PA2
Correlation of (regression) scores with factors   0.97 0.94
Multiple R square of scores with factors          0.93 0.88
Minimum correlation of possible factor scores     0.86 0.77
A matrix: 6 × 2 of type dbl
PA1PA2
general0.640.69
picture0.320.61
blocks0.430.91
maze0.250.45
reading0.950.46
vocab0.830.45

上述矩阵即为变量与因子间的相关系数,与正交旋转所得因子荷载阵相比,因为你允许潜在因子相关,所以你会发现该荷载阵列的噪音比较大。斜交需要关注三个矩阵, 相比较而言,斜交更为复杂,但是模型将更符合真实数据。

一张图总结PCA和EFA

image.png