What is the relationship between number of workers and partitions in dask_ml.linear_models?

I have a dataset with 148 partitions of more that 2GB each on which I would like to execute dask_ml.linear_model.LogisticRegression in a cluster with workers of 4GB mem and 160 cores.

When I execute the
.fit([df_X.to_dask_array(), df_y.to_dask_array()) method, I only see 4 workers working and a large number of data exchange.

Could anyone, please, explain the relationship between the number of workers and partitions during fitting an dasm_ml linear model estimator?

Hi @rdanger and welcome!

It’s really difficult to tell what’s the problem here without more information.

How is your Dask Cluster created, with which parameters? Workers with only 4GB RAM for 160 cores sounds weird.

What is the code you’re trying to execute, could you provide a MWE? Or at least a code snippet?

2 GB per partition sounds also big, could you try to read them with a smaller chunk size?

In normal situations, your input partitions should be distributed across all the workers you have. Workers should compute partial_git eagerly on your dataset’s partitions. So you shouldn’t have more workers than partitions, but with less workers than partitions, partitions should be distributed approximately equally between workers.

1 Like

Thank you very much @guillaumeeb . Your answer has helped to configure my cluster correctly and to create a good repartition of my data.

1 Like