当前位置：网站首页>【机器学习】Kmeans聚类

【机器学习】Kmeans聚类

2022-07-21 10:22:00 【jeffery0207】

写在篇前

Kmeans算法是一种经典的聚类算法，属于无监督学习的范畴。所谓聚类，即指对于给定的一个样本集，按照样本之间的距离大小，将样本集划分为K个簇，且让簇内的点尽量紧密的连在一起，而让簇间的距离尽量的大。

优点：

原理简单
速度快
对大数据集有比较好的伸缩性

缺点：

需要指定聚类数量K
对异常值敏感
对初始值敏感

原理概述

算法流程

以下描述基于 Lloyd’s 算法

设定一个k值，即设定需要聚类多少个簇
随机选择k个质心（centroids）
计算各个样本点到质心的距离，划分簇
重复第2、3步骤，直至迭代次数达到设定的最大值或者质心不再移动

评价准则

The k-means algorithm divides a set of N samples X into K disjoint clusters C, each described by the mean $\mu_j$ of the samples in the cluster. The means are commonly called the cluster “centroids”; note that they are not, in general, points from X, although they live in the same space. The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum of squared criterion:
$\sum\limits_{i=1}^k\sum\limits_{x \in C_i} ||x-\mu_i||_2^2$
attention, $\mu_j$ is just the so-called centroids:
$\mu_i = \frac{1}{|C_i|}\sum\limits_{x \in C_i}x$

值得注意的是：

Inertia假设聚类是凸的和各向同性的（convex and isotropic），但它对细长簇或具有不规则形状数据聚类不佳；
Inertia不是标准化的度量标准：我们只知道较低的值更好，零是最佳的；

算法改进

Kmeans ++

k个初始化的质心的位置选择对最后的聚类结果和运行时间都有很大的影响，因此需要选择合适的k个质心。K-Means++算法就是对K-Means随机初始化质心的方法的优化：

从输入的数据点集合中随机选择一个点作为第一个聚类中心\mu_1
对于数据集中的每一个点x_i，计算它与已选择的聚类中心中最近聚类中心的距离
$D(x_i) = arg\;min||x_i- \mu_r||2^2\;\;r=1,2,...k{selected}$
选择一个新的数据点作为新的聚类中心，选择的原则是：D(x)较大的点，被选取作为聚类中心的概率较大
重复b和c直到选择出k个聚类质心
利用这k个质心来作为初始化质心去运行标准的K-Means算法

elkan

在传统的K-Means算法中，我们在每轮迭代时，要计算所有的样本点到所有的质心的距离，这样会比较的耗时。elkan K-Means算法就是从这块入手加以改进。它的目标是减少不必要的距离的计算。elkan K-Means利用了两边之和大于等于第三边,以及两边之差小于第三边的三角形性质，来减少距离的计算。利用上边的两个规律，elkan K-Means比起传统的K-Means迭代速度有很大的提高。但是如果我们的样本的特征是稀疏的，有缺失值的话，这个方法就不使用了，此时某些距离无法计算，则不能使用该算法。

mini-batch

MiniBatch-KMeans是KMeans算法的一种变体，它使用mini-batch来减少计算时间，同时仍试图优化相同的目标函数。mini-batch是输入数据集的子集，在每次训练迭代中随机采样。它大大减少收敛到局部最优值所需的计算量，并达到大致相同的效果。

算法实现

在本篇主要借助sklearn包提供的接口来实现kmeans算法，具体的实现当然我们可以直接看源码啦~首先看一下构造函数：

    
    def __init__(self,
                 n_clusters=8,  # 即理论部分的k值
                 init='k-means++',  # 质心初始化方法
                 n_init=10,  # 质心初始化次数，最后结果取结果最好的一个
                 max_iter=300,  # 最大迭代次数
                 tol=1e-4,  # 容忍度，即kmeans运行准则收敛的条件
                 precompute_distances='auto',  # 是否需要提前计算距离,并将其放入内存
                 verbose=0,  # 冗长模式，深入看源码你会发现和操作系统底层有关，个人认为与算法本身无关
                 random_state=None,  # 实际上是种子，改参数会传入check_random_state()函数
                 copy_x=True,  # 当并行计算时，必须为True，为数据建立copy
                 n_jobs=1,  # 并行计算进程数，-1时表示占满cpu
                 algorithm='auto'  # ‘auto’, ‘full’, ‘elkan’, 其中 'full’表示用EM方式实现,即传统的kmeans算法计算距离
                )

上面这些参数可以一一对应到上面讲的理论部分，为了进一步了解我们继续深入，看看一个Kmeans对象有哪些属性和方法，同样直接看代码：

#! /usr/bin/python
# _*_ coding: utf-8 _*_
__author__ = 'Jeffery'

import numpy as np
from sklearn.cluster import KMeans

data = np.random.rand(100, 3)  # 生成一个随机数据,shape(100, 3)
estimator = KMeans(n_clusters=3,
                   init='k-means++',
                   n_init=10,
                   max_iter=300,
                   tol=1e-4,
                   precompute_distances='auto', 
                   verbose=0,
                   random_state=None,
                   copy_x=False,
                   n_jobs=1,
                   algorithm='auto') # 对于非稀疏数据，会选择elkan算法

estimator.fit(data)  # 聚类

# --------------------属性-------------------------
# 获取estimator构造函数中设置的属性
print('n_clusters:', estimator.n_clusters)
print('init:', estimator.init)
print('max_iter:', estimator.max_iter)
print('tol:', estimator.tol)
print('precompute_distances:', estimator.precompute_distances)
print('verbose:', estimator.verbose)
print('random_state:', estimator.random_state)
print('copy_x:', estimator.copy_x)
print('n_jobs:', estimator.n_jobs)
print('algorithm:', estimator.algorithm)
print('n_init:', estimator.n_init)

# 其他重要属性
print('n_iter_:', estimator.n_iter_)  # 实际迭代次数
print('cluster_centers_:', estimator.cluster_centers_)  # 获取聚类中心点结果
print('inertia_:', estimator.inertia_)  # 获取聚类准则的总和，越小越好
print('labels_:', estimator.labels_)  # 获取聚类标签结果
# --------------------函数-------------------------
print('get_params:', estimator.get_params(deep=True))  # 与set_params()相对

# 这里的介绍先不讲 fit()、transform()、fit_transform()、fit_predict()、predict()、score()等函数
# 这些函数放在案例里面

案例

这个示例来源于官网，这个示例旨在说明k-means的一些不合适的场景：在前三个图中，输入数据不符合k-means所产生的一些隐含假设，结果产生了不希望的聚类结果；在最后一个图中，k-means返回直观、合理的簇，尽管大小不均匀。这个示例很简单，但是一定要充分看到kmeans的应用关键，比如k选取的重要性、聚类各向异性数据的局限性

#! /usr/bin/python
# _*_ coding: utf-8 _*_
__author__ = 'Jeffery'
__date__ = '2018/11/11 13:55'

import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

plt.figure(figsize=(12, 12))

n_samples = 1500
random_state = 170
X, y = make_blobs(n_samples=n_samples, random_state=random_state)

# Incorrect number of clusters
y_pred = KMeans(n_clusters=2, random_state=random_state).fit_predict(X)

plt.subplot(221)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.title("Incorrect Number of Blobs")

# Anisotropicly distributed data
transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
X_aniso = np.dot(X, transformation)
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_aniso)

plt.subplot(222)
plt.scatter(X_aniso[:, 0], X_aniso[:, 1], c=y_pred)
plt.title("Anisotropicly Distributed Blobs")

# Different variance
X_varied, y_varied = make_blobs(n_samples=n_samples,
                                cluster_std=[1.0, 2.5, 0.5],
                                random_state=random_state)
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_varied)

plt.subplot(223)
plt.scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)
plt.title("Unequal Variance")

# Unevenly sized blobs
X_filtered = np.vstack((X[y == 0][:500], X[y == 1][:100], X[y == 2][:10]))
y_pred = KMeans(n_clusters=3,
                random_state=random_state).fit_predict(X_filtered)

plt.subplot(224)
plt.scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_pred)
plt.title("Unevenly Sized Blobs")

plt.show()