A Rejection Sampling Approach to k-means++

This project contains an implementation of the rejection sampling based RS-k-means++ algorithm for performing the k-means++ seeding [AV07] . The implementation is compatible with NumPy and scikit-learn.

Installation

This package can be directly installed from PyPI using

pip install rskpp

Algorithm Reference

rskmeanspp(data: np.array, k: int, m : int) -> np.array:

Parameters:

  • data (numpy.array): Dataset of shape (n, d), where n is the number of data points and d is the number of features.

  • k (int): The number of clusters.

  • m (int): The upper bound on the number of rejection sampling iterations.

Returns:

  • (numpy.array): A NumPy array of shape (k, d) containing the k cluster centers.

Example Usage

The rskmeanspp function performs the seeding, which can used in conjunction with the k-means algorithm (also commonly known as Lloyd’s iterations) from scikit-learn.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from rskpp.rskpp import rskmeanspp

# Generate synthetic dataset with 3 clusters
n_samples = 500
n_features = 2
n_clusters = 3

data, _ = make_blobs(n_samples=n_samples, centers=n_clusters, n_features=n_features, random_state=42)

# Apply rskpp function
k = n_clusters  # Number of clusters
m = 200  # Upper bound on rejection sampling iterations

centers = rskmeanspp(data, k, m)

# Apply KMeans using rskmeanspp centers as initialization
kmeans = KMeans(n_clusters=k, init=centers, n_init=1, random_state=42)
kmeans.fit(data)

# Plot dataset and final cluster centers
plt.scatter(data[:, 0], data[:, 1], s=10, label="Data points")
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='x', s=100, label="Final centers")

plt.legend()
plt.title("K-Means with RS-k-means++ Initialization")
plt.savefig("cluster-plot.png")
plt.show()

License

The project is licensed under the MIT License.

[AV07]

David Arthur and Sergei Vassilvitskii. “k-means++: the advantages of careful seeding”.In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms.SODA ’07. New Orleans, Louisiana: Society for Industrial and Applied Mathematics, 2007,pp. 1027–1035. isbn: 9780898716245. url: https://theory.stanford.edu/~sergei/papers/kMeansPP-soda.pdf.