A Rejection Sampling Approach to k-means++
This project contains an implementation of the rejection sampling based RS-k-means++ algorithm for performing the k-means++ seeding [AV07] . The implementation is compatible with NumPy and scikit-learn.
Installation
This package can be directly installed from PyPI using
pip install rskpp
Algorithm Reference
rskmeanspp(data: np.array, k: int, m : int) -> np.array:
Parameters:
data (numpy.array): Dataset of shape (n, d), where n is the number of data points and d is the number of features.
k (int): The number of clusters.
m (int): The upper bound on the number of rejection sampling iterations.
Returns:
(numpy.array): A NumPy array of shape (k, d) containing the k cluster centers.
Example Usage
The rskmeanspp function performs the seeding, which can used in conjunction with the k-means algorithm (also commonly known as Lloyd’s iterations) from scikit-learn.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from rskpp.rskpp import rskmeanspp
# Generate synthetic dataset with 3 clusters
n_samples = 500
n_features = 2
n_clusters = 3
data, _ = make_blobs(n_samples=n_samples, centers=n_clusters, n_features=n_features, random_state=42)
# Apply rskpp function
k = n_clusters # Number of clusters
m = 200 # Upper bound on rejection sampling iterations
centers = rskmeanspp(data, k, m)
# Apply KMeans using rskmeanspp centers as initialization
kmeans = KMeans(n_clusters=k, init=centers, n_init=1, random_state=42)
kmeans.fit(data)
# Plot dataset and final cluster centers
plt.scatter(data[:, 0], data[:, 1], s=10, label="Data points")
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='x', s=100, label="Final centers")
plt.legend()
plt.title("K-Means with RS-k-means++ Initialization")
plt.savefig("cluster-plot.png")
plt.show()
License
The project is licensed under the MIT License.
David Arthur and Sergei Vassilvitskii. “k-means++: the advantages of careful seeding”.In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms.SODA ’07. New Orleans, Louisiana: Society for Industrial and Applied Mathematics, 2007,pp. 1027–1035. isbn: 9780898716245. url: https://theory.stanford.edu/~sergei/papers/kMeansPP-soda.pdf.