Dask_ml KMeans optimization

tusca99 · September 26, 2024, 9:48am

Hello,

My colleague and I were working on a project to implement the KMeans|| initialization in a distributed environment, for which we decided to use Dask. As part of this effort, we compared the performance of our code against the KMeans implementation in the dask_ml library.

While reviewing the dask_ml KMeans class, I noticed that standard KMeans is used for centroid re-clustering. After further investigation, we decided to incorporate weights in two parts of our computation:

KMeans++ initialization
Weighted average during centroid re-clustering.

Although our implementation is currently less efficient than the one in dask_ml, we were able to outperform it when clustering a blob dataset. We believe this is due to fewer iterations of clustering, rather than direct code optimization.

If you’re interested, feel free to check out our repository and take a look at the code:
github_repo

guillaumeeb · September 27, 2024, 1:38pm

Thanks for sharing @tusca99!

Also, feel secure to provide feedback or propose changes directly on dask-ml github repository!

Topic		Replies	Views
K-Means with Weights dask-ml	0	218	September 1, 2022
How parallelism works in dask-ml and what are its performance gain blocks Distributed	3	408	June 27, 2023
Clarification on Distributed Dask ML (Is ML really Distributed?) Distributed dask-ml	1	236	August 23, 2022
Problems on running cuml.dask.cluster.KMeans	1	53	May 31, 2024
Basic question about parallelizing different model fits with dask(-ml) Distributed dask-ml	0	191	November 17, 2022

Dask_ml KMeans optimization

Related topics