Hello,
My colleague and I were working on a project to implement the KMeans|| initialization in a distributed environment, for which we decided to use Dask. As part of this effort, we compared the performance of our code against the KMeans implementation in the dask_ml
library.
While reviewing the dask_ml
KMeans class, I noticed that standard KMeans is used for centroid re-clustering. After further investigation, we decided to incorporate weights in two parts of our computation:
- KMeans++ initialization
- Weighted average during centroid re-clustering.
Although our implementation is currently less efficient than the one in dask_ml
, we were able to outperform it when clustering a blob dataset. We believe this is due to fewer iterations of clustering, rather than direct code optimization.
If you’re interested, feel free to check out our repository and take a look at the code:
github_repo