Using processes instead of threads for workers

ljstrnadiii · June 1, 2023, 8:14pm

After noticing GIL contention with the new GIL monitoring tab in the daskui, I am noticing some work that might benefit from processes instead of threads.

Is there a way to configure dask to use processes instead of threads for each task when using the dask-operator / dask-kubernetes / dask-helm without being able to modify the startup command?

I am thinking something like setting an environment variable before startup for example:

"DASK_DISTRIBUTED__WORKER__PROCESS__USE_THREADS": "False"

Is this a possibility?

ljstrnadiii · June 1, 2023, 8:16pm

@jacobtomlinson maybe you have some insight here?

guillaumeeb · June 2, 2023, 9:56am

Hi @ljstrnadiii,

Could you be a little more precise about what you can modify and what you cannot? I understand you cannot modify the KubeCluster constructor call for instance? But you are able to modify some environment variable before it is called? You cannot modify any yaml file or call kubectl commands?

ljstrnadiii · June 2, 2023, 2:21pm

Hey @guillaumeeb, thanks for the response.

I am able to modify the actual dask code being run on the cluster and I can set an environment variable before the dask-worker startup command.

I have gathered that dask.config.set(scheduler='processes') is only a local config and I don’t think there is an environment variable to set like DASK_DISTRIBUTED__WORKER__PROCESS__USE_THREADS.

Last resort would be to modify the dask-worker startup command (or dask-scheduler). Looking at help for those commands also does not make it super clear to me.

Does this add a bit more clarity?

guillaumeeb · June 4, 2023, 1:26am

If you are able to modify dask-worker startup command, then what you want are the nworkers and nthreads options.

I’m not sure if those can be defined using configuration or environment variables.

crusaderky · June 12, 2023, 9:53pm

There are two ways to avoid GIL contention.

The first is to have worker processes with a single thread each - typically, more than one worker process per host. The dask-worker command line does that. This has the drawback that data at rest is replicated in each process, and that the worker’s network I/O will likely be clogged by the GIL contention.

The second is to run a ProcessPoolExecutor inside the worker. Have a look at the executor parameter of the Worker class. dask-worker doesn’t offer that out of the box, but whatever you’re using to start the workers might. The drawback of this solution is that you’ll pay for serialization, IPC transfer, and deserialization for all data in and out of each task.

Topic		Replies	Views
How does Dask avoid GIL in multi threading? Distributed	2	750	May 22, 2023
Parallelisation by multiprocessing not multithreading on SLURMCluster Distributed	1	327	April 23, 2022
How to keep GIL-holding tasks from killing your workers? Distributed	5	1097	April 23, 2023
Is there a way to use threads / processes exclusively for a code block delayed	1	419	August 22, 2022
GIL monitoring in Dask Blogs distributed	0	298	May 16, 2023

Using processes instead of threads for workers

Related topics