How parallelism works in dask-ml and what are its performance gain blocks

Hi @guillaumeeb,
I have following doubts:

  1. why dask-ml algorithms like kmeans, linear regression are not upstreamed(contributed) to stock scikit-learn?
  2. On single node, how parallelism works in dask-ml compared to oneTBB, openmp threading modules?
  3. What are the main performance gain blocks in daskml compared to the stock scikit-learn implementation?

Hi @vineel, welcome to Dask community forum!

I’m not sure if I’m the best person for a complete answer, ccing @TomAugspurger to complete.

  1. dask-ml depends on sklearn, it uses a lot of sklearn algorithms, and it build upon it to propose distributed algorithm optimized for Dask. dask-ml algorithms require Dask, sickit-learn needs to be able to work without Dask.
  2. Dask-ml doesn’t try to parallelize low level code like you would do with OpenMP directives. It either parallelize a single model learning by chunking the data and using partial fitting on each chunk of data, either parallelize multiple model learning for performing a GridSearch for example. Some other algorithm that can well be parallelized like RandomForest (where you can fit multiple tree models at the same time) also benefit from dask-ml.
  3. dask-ml is well suited if you have a lot of data and a model that can be learn by chunks, or if you need to learn in parallel several models that fit in memory. See Dask-ML — dask-ml 2022.5.28 documentation.
  1. For K-means, Dask-ML implements the k-means|| initialization strategy. My (possibly wrong understanding) is that it’s going to be worse than k-means++ for data that fit in memory (like scikit-learn typically focuses on). I think the linear models / minimizers like admm can take a while to converge, but I haven’t looked into that deeply. Other algorithms implemented first in Dask-ML, like Hyperband, have I think since been implemented in scikit-learn.

@guillaumeeb’s answer covers the rest. Dask-ML isn’t really useful for single-node parallelism, and the performance gains will typically come when you have a task that’s compute- or memory-bound that can be distributed on a cluster (so not too much communication).