Dask_ml KMeans optimization

Hello,

My colleague and I were working on a project to implement the KMeans|| initialization in a distributed environment, for which we decided to use Dask. As part of this effort, we compared the performance of our code against the KMeans implementation in the dask_ml library.

While reviewing the dask_ml KMeans class, I noticed that standard KMeans is used for centroid re-clustering. After further investigation, we decided to incorporate weights in two parts of our computation:

  • KMeans++ initialization
  • Weighted average during centroid re-clustering.

Although our implementation is currently less efficient than the one in dask_ml, we were able to outperform it when clustering a blob dataset. We believe this is due to fewer iterations of clustering, rather than direct code optimization.

If you’re interested, feel free to check out our repository and take a look at the code:
github_repo

1 Like

Thanks for sharing @tusca99!

Also, feel secure to provide feedback or propose changes directly on dask-ml github repository!

1 Like