How parallelism works in dask-ml and what are its performance gain blocks

vineel · May 24, 2023, 6:01am

Hi @guillaumeeb,
I have following doubts:

why dask-ml algorithms like kmeans, linear regression are not upstreamed(contributed) to stock scikit-learn?
On single node, how parallelism works in dask-ml compared to oneTBB, openmp threading modules?
What are the main performance gain blocks in daskml compared to the stock scikit-learn implementation?

guillaumeeb · May 30, 2023, 3:22am

Hi @vineel, welcome to Dask community forum!

I’m not sure if I’m the best person for a complete answer, ccing @TomAugspurger to complete.

dask-ml depends on sklearn, it uses a lot of sklearn algorithms, and it build upon it to propose distributed algorithm optimized for Dask. dask-ml algorithms require Dask, sickit-learn needs to be able to work without Dask.
Dask-ml doesn’t try to parallelize low level code like you would do with OpenMP directives. It either parallelize a single model learning by chunking the data and using partial fitting on each chunk of data, either parallelize multiple model learning for performing a GridSearch for example. Some other algorithm that can well be parallelized like RandomForest (where you can fit multiple tree models at the same time) also benefit from dask-ml.
dask-ml is well suited if you have a lot of data and a model that can be learn by chunks, or if you need to learn in parallel several models that fit in memory. See Dask-ML — dask-ml 2022.5.28 documentation.

TomAugspurger · May 30, 2023, 11:57am

For K-means, Dask-ML implements the k-means|| initialization strategy. My (possibly wrong understanding) is that it’s going to be worse than k-means++ for data that fit in memory (like scikit-learn typically focuses on). I think the linear models / minimizers like admm can take a while to converge, but I haven’t looked into that deeply. Other algorithms implemented first in Dask-ML, like Hyperband, have I think since been implemented in scikit-learn.

@guillaumeeb’s answer covers the rest. Dask-ML isn’t really useful for single-node parallelism, and the performance gains will typically come when you have a task that’s compute- or memory-bound that can be distributed on a cluster (so not too much communication).

vineel · June 27, 2023, 9:25am

Hi @TomAugspurger & @guillaumeeb,
Thanks for the response and clarifying the doubts.

Topic		Replies	Views
What speedups can I expect from training with DaskML? dask-ml	2	442	January 4, 2022
Basic question about parallelizing different model fits with dask(-ml) Distributed dask-ml	0	192	November 17, 2022
Clarification on Distributed Dask ML (Is ML really Distributed?) Distributed dask-ml	1	237	August 23, 2022
Performance Problem (or missing understanding) with LinearRegression Dask Array dask-array , dask-ml	3	51	December 18, 2024
Dask - ml /Dask -sql with pytorch,keras or tensorflow dask-ml	1	176	August 24, 2023

How parallelism works in dask-ml and what are its performance gain blocks

Related topics