Using processes instead of threads for workers

After noticing GIL contention with the new GIL monitoring tab in the daskui, I am noticing some work that might benefit from processes instead of threads.

Is there a way to configure dask to use processes instead of threads for each task when using the dask-operator / dask-kubernetes / dask-helm without being able to modify the startup command?

I am thinking something like setting an environment variable before startup for example:

"DASK_DISTRIBUTED__WORKER__PROCESS__USE_THREADS": "False"

Is this a possibility?

@jacobtomlinson maybe you have some insight here?

Hi @ljstrnadiii,

Could you be a little more precise about what you can modify and what you cannot? I understand you cannot modify the KubeCluster constructor call for instance? But you are able to modify some environment variable before it is called? You cannot modify any yaml file or call kubectl commands?

Hey @guillaumeeb, thanks for the response.

I am able to modify the actual dask code being run on the cluster and I can set an environment variable before the dask-worker startup command.

I have gathered that dask.config.set(scheduler='processes') is only a local config and I don’t think there is an environment variable to set like DASK_DISTRIBUTED__WORKER__PROCESS__USE_THREADS.

Last resort would be to modify the dask-worker startup command (or dask-scheduler). Looking at help for those commands also does not make it super clear to me.

Does this add a bit more clarity?

If you are able to modify dask-worker startup command, then what you want are the nworkers and nthreads options.

I’m not sure if those can be defined using configuration or environment variables.

There are two ways to avoid GIL contention.

The first is to have worker processes with a single thread each - typically, more than one worker process per host. The dask-worker command line does that. This has the drawback that data at rest is replicated in each process, and that the worker’s network I/O will likely be clogged by the GIL contention.

The second is to run a ProcessPoolExecutor inside the worker. Have a look at the executor parameter of the Worker class. dask-worker doesn’t offer that out of the box, but whatever you’re using to start the workers might. The drawback of this solution is that you’ll pay for serialization, IPC transfer, and deserialization for all data in and out of each task.

1 Like