1.聚类

1.1 K-means聚类

In [7]:
#用iris数据集进行聚类分析
iris2<-iris[,1:4]
iris.kmeans<-kmeans(iris2,3)#选择聚类个数为3
iris.kmeans
K-means clustering with 3 clusters of sizes 50, 62, 38

Cluster means:
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1     5.006000    3.428000     1.462000    0.246000
2     5.901613    2.748387     4.393548    1.433871
3     6.850000    3.073684     5.742105    2.071053

Clustering vector:
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
[112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
[149] 3 2

Within cluster sum of squares by cluster:
[1] 15.15100 39.82097 23.87947
 (between_SS / total_SS =  88.4 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
In [8]:
#用table函数查看分类结果情况
table(iris$Species,iris.kmeans$cluster)
            
              1  2  3
  setosa     50  0  0
  versicolor  0 48  2
  virginica   0 14 36
In [13]:
#上述属性中,最常用的就是centers和cluster属性,即中心点和聚类的分类集合
#下面我们将分类中心及中心点打印出来
plot(iris2$Sepal.Length,iris2$Sepal.Width,col=iris.kmeans$cluster,pch=as.integer(iris$Species))
points(iris.kmeans$centers,pch="X",cex=1.5,col=4)
#图注:颜色是实际聚类结果,形状是原始数据真实分类(共错分16个).

1.2 k-means++(k-means的改进)

1.3 GMM(基于模型的算法)

In [1]:
library(MASS)#载入包
library(ggplot2)
library(ggpubr)
data("geyser")#引入到R空间数据
ggscatter(geyser,x="duration",y="waiting")+geom_density2d()
#画二维概率密度图,在Rstudio里运行后如下所示,这样大体知道数据是来自三个二维正态分布的,将三个二维正态分布看成一个整体,就形成了一个混合模型.
Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
Error in library(ggpubr): there is no package called 'ggpubr'
Traceback:

1. library(ggpubr)

gailvmidutu.png

In [ ]:
#载入包
library(mclust)
library(ggplot2)
library(factoextra)
set.seed(123)#设置随机种子
#混合模型聚类
mc<-Mclust(geyser)
ggplot(geyser,aes(x=duration,y=waiting,colour=mc$classification))+geom_point()
#在Rstudio里运行后如下图所示:

GMM.png

1.4 DBSCAN(基于密度的算法)

In [2]:
library(fpc)
iris2<-iris[-5]
#?dbscan
ds<-dbscan(iris2,eps=0.42,MinPts=5)
table(ds$cluster,iris$Species)
#打印出ds和iris2的聚类散点图
plot(ds,iris2)
   
    setosa versicolor virginica
  0      2         10        17
  1     48          0         0
  2      0         37         0
  3      0          3        33
In [15]:
#打印出iris第一列和第四列为坐标轴的聚类结果
plot(ds,iris2[,c(1,4)])
In [16]:
#另一个表示聚类结果的函数
#?plotcluster
plotcluster(iris2,ds$cluster)

1.5 谱聚类

In [26]:
library(kernlab)
library(ggplot2)
data(spirals)
df<-as.data.frame(spirals)#将数据设置为数据框格式
names(df)<-c("x1","x2")#重命名
ggplot(df,aes(x=x1,y=x2))+geom_point()#查看原始数据
Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang

Attaching package: 'ggplot2'

The following object is masked from 'package:kernlab':

    alpha

In [27]:
sc<-specc(spirals,centers=2)
df1<-df
df1$class<-as.factor(sc@.Data)#将类标签和原始数据融合
ggplot(df1,aes(x=x1,y=x2,colour=class))+geom_point()#进行可视化

1.6 层次聚类

In [18]:
dim(iris)
idx<-sample(1:dim(iris)[1],40)
iris3<-iris[idx,-5]
iris3
  1. 150
  2. 5
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
1166.43.25.32.3
1206.02.25.01.5
526.43.24.51.5
1056.53.05.82.2
1216.93.25.72.3
424.52.31.30.3
915.52.64.41.2
195.73.81.70.3
415.03.51.30.3
905.52.54.01.3
655.62.93.61.3
144.33.01.10.1
1225.62.84.92.0
705.62.53.91.1
475.13.81.60.2
1396.03.04.81.8
536.93.14.91.5
1096.72.55.81.8
1367.73.06.12.3
15.13.51.40.2
576.33.34.71.6
85.03.41.50.2
1067.63.06.62.1
646.12.94.71.4
636.02.24.01.0
245.13.31.70.5
696.22.24.51.5
1496.23.45.42.3
314.83.11.60.2
1476.32.55.01.9
464.83.01.40.3
445.03.51.60.6
545.52.34.01.3
384.93.61.40.1
1486.53.05.22.0
1286.13.04.91.8
1087.32.96.31.8
715.93.24.81.8
835.82.73.91.2
766.63.04.41.4
In [25]:
#?hclust
hc<-hclust(dist(iris3),method="ave")
plot(hc,hang=-1,labels=iris$Species[idx])#这里hang=-1使得树的节点在下方对齐
#将树分为3块
rect.hclust(hc,k=3)
#?cutree
groups<-cutree(hc,k=3)

1.7 聚类结果的评价指标

1.8 聚类个数的确定

1.8.1手肘法elbow

In [28]:
#手肘法,这里仍以iris数据集为例
data(iris)
dataset<-iris[,-5]#把第5列去掉
dataset<-scale(dataset)#因为每一列的值差别很大,从1到100多的都有,这样会造成误差,所以需要归一化。
#编写代价函数
wssplot<-function(data,nc=5,seed=1234){
  wss<-(nrow(data)-1)*sum(apply(data,2,var))
  for(i in 1:nc){
    set.seed(seed)
    wss[i]<-sum(kmeans(data,centers=i)$withinss)
  }
  plot(1:nc,wss,type="b",xlab="Number of Clusters",ylab="Within groups sum of squares")}
wssplot(dataset)
#由下图可以看出从一类到四类下降的很快,之后下降的很慢,所以最佳聚类个数为3.

1.8.2Gap-statistic

In [27]:
library(cluster)
set.seed(123)
#?clusGap
gap_clust<-clusGap(dataset,kmeans,10,B=100,verbose=interactive())#这里的dataset是手肘法里定义的将iris第五列去掉后的数据。
install.packages("factoextra")
library(factoextra)
fviz_gap_stat(gap_clust)
#下图是在Rstudio里运行出的结果,可以看出,在聚类数为4的时候gap值取到了最大值,所以最佳聚类数为3.
also installing the dependencies 'viridis', 'ellipse', 'flashClust', 'leaps', 'scatterplot3d', 'dendextend', 'FactoMineR', 'ggpubr'

  There is a binary version available but the source version is later:
       binary source needs_compilation
ggpubr  0.3.0  0.4.0             FALSE

package 'viridis' successfully unpacked and MD5 sums checked
package 'ellipse' successfully unpacked and MD5 sums checked
package 'flashClust' successfully unpacked and MD5 sums checked
package 'leaps' successfully unpacked and MD5 sums checked
package 'scatterplot3d' successfully unpacked and MD5 sums checked
package 'dendextend' successfully unpacked and MD5 sums checked
package 'FactoMineR' successfully unpacked and MD5 sums checked
package 'factoextra' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Administrator\AppData\Local\Temp\RtmpIdAud2\downloaded_packages
installing the source package 'ggpubr'

Warning message in install.packages("factoextra"):
"installation of package 'ggpubr' had non-zero exit status"Warning message:
"package 'factoextra' was built under R version 3.6.3"Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Error in loadNamespace(name): there is no package called 'ggpubr'
Traceback:

1. fviz_gap_stat(gap_clust)
2. ggpubr::ggline
3. getExportedValue(pkg, name)
4. asNamespace(ns)
5. getNamespace(ns)
6. loadNamespace(name)
7. withRestarts(stop(cond), retry_loadNamespace = function() NULL)
8. withOneRestart(expr, restarts[[1L]])
9. doWithOneRestart(return(expr), restart)

Gap-statistic.png