A Rejection Sampling Approach to k-means++
===========================================


This project contains an implementation of the rejection sampling based RS-k-means++ algorithm for performing the k-means++ seeding [AV07]_ . The implementation is compatible with NumPy and scikit-learn. 

Installation
------------

This package can be directly installed from PyPI using

.. code-block:: python 

   pip install rskpp


Algorithm Reference 
---------------------


.. code-block:: python  

   rskmeanspp(data: np.array, k: int, m : int) -> np.array:

**Parameters:**

- **data** (`numpy.array`): Dataset of shape (n, d), where `n` is the number of data points and `d` is the number of features.
- **k** (`int`): The number of clusters.
- **m** (`int`): The upper bound on the number of rejection sampling iterations.

**Returns:**

- (`numpy.array`): A NumPy array of shape (k, d) containing the `k` cluster centers.


Example Usage 
--------------

The ``rskmeanspp`` function performs the seeding, which can used in conjunction with the k-means algorithm (also commonly known as Lloyd's iterations) from scikit-learn. 


.. code-block:: python  

   import numpy as np
   import matplotlib.pyplot as plt
   from sklearn.datasets import make_blobs
   from sklearn.cluster import KMeans
   from rskpp.rskpp import rskmeanspp

   # Generate synthetic dataset with 3 clusters
   n_samples = 500
   n_features = 2
   n_clusters = 3

   data, _ = make_blobs(n_samples=n_samples, centers=n_clusters, n_features=n_features, random_state=42)

   # Apply rskpp function
   k = n_clusters  # Number of clusters
   m = 200  # Upper bound on rejection sampling iterations

   centers = rskmeanspp(data, k, m)

   # Apply KMeans using rskmeanspp centers as initialization
   kmeans = KMeans(n_clusters=k, init=centers, n_init=1, random_state=42)
   kmeans.fit(data)

   # Plot dataset and final cluster centers
   plt.scatter(data[:, 0], data[:, 1], s=10, label="Data points")
   plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='x', s=100, label="Final centers")

   plt.legend()
   plt.title("K-Means with RS-k-means++ Initialization")
   plt.savefig("cluster-plot.png")
   plt.show()


License 
--------

The project is licensed under the MIT License.


.. [AV07] David Arthur and Sergei Vassilvitskii. “k-means++: the advantages of careful seeding”.In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms.SODA ’07. New Orleans, Louisiana: Society for Industrial and Applied Mathematics, 2007,pp. 1027–1035. isbn: 9780898716245. url: https://theory.stanford.edu/~sergei/papers/kMeansPP-soda.pdf.


.. toctree::
   :maxdepth: 2

   description
   reference 
   example
   experiments