gRPC crashes dask worker

Fogapod · August 28, 2025, 3:22pm

I recently added a library which uses grpc under the hood to my pipeline and workers started dying randomly. Presumably at compute() points but hard to tell.

I use this library by creating client in preload step at worker startup and attach it to worker instance, then use get_worker() to access it.

This is how all crashes look like:

F0000 00:00:1755796341.585627     236 forkable.cc:58] Check failed: !std::exchange(is_forking_, true)                                                                                                                                                                                    
*** Check failure stack trace: ***                                                                                                                                                                                                                                                       
Unspecified Application Error                                                                                                                                                                                                                                                            
2025-08-21 17:12:21,637 - distributed.nanny - INFO - Worker process 79 exited with status 1                                                                                                                                                                                              
2025-08-21 17:12:21,655 - distributed.nanny - WARNING - Restarting worker

I also get a lot of these messages although they seem harmless:

Other threads are currently calling into gRPC, skipping fork() handlers

I found a possible solution suggesting using multiprocessing.set_start_method("forkserver") but neither forkserver or spawn modes worked (dask might be overwriting this somewhere. i replaced https://github.com/dask/dask/blob/main/dask/\__main_\_.py and put set_start_methodunder if name == “main”:)

Has anyone seen something similar when using gRPC? What else is there to try to make it work in dask’s parallelized multiprocessing/threading environment?

guillaumeeb · August 28, 2025, 4:31pm

Hi @Fogapod,

If you are using a distributed Scheduler, you can set the fork method using dask Configuration object. The correct value is distributed.worker.multiprocessing-method, see also Configuration — Dask documentation.

Did you try with a threaded Scheduler to see if you had problems?

Fogapod · August 28, 2025, 4:48pm

I will try this, thanks

Fogapod · August 28, 2025, 4:51pm

Although it seems to default to spawn which should work, at least in python signature in docs: 'multiprocessing-method': 'spawn'

Fogapod · August 28, 2025, 7:22pm

This did not work. I made sure setting is read correctly by entering invalid values too

guillaumeeb · August 29, 2025, 12:49pm

Hmm, I don’t have much other suggestions, do you have more detail error stacktrace? Could you come up with a reproducer?

Fogapod · August 29, 2025, 1:23pm

no. this is all i have. it abruptly stops with this message

unfortunately no, at least not now. the workflow is quite complex and it fails randomly at different points

i dont think i can do that either because everything is configured for distributed deployment

I filed an issue for that library too, ill post back if i find something. Maybe i will try with grpcio too later

guillaumeeb · August 29, 2025, 3:27pm

Which kind of Cluster MAnager do you use (e.g. LocalCluster or something else?). A thing to try would be to disable Nanny and see if it does something.

Fogapod · August 29, 2025, 4:30pm

I deploy dask in kubernetes. I have custom image with copypasted main.pyentrypoint and custom preload files. Hope this answers the question

Topic		Replies	Views
TimeoutError in distributed.nanny causing gRPC server crash after prolonged analysis Distributed	1	124	July 18, 2024
Deploy dask docker containers over multiple machines Deploying Dask	3	729	August 2, 2023
Start dask-mpi problem Distributed dask-mpi , distributed	4	736	May 4, 2023
Why did my worker restart? Distributed	4	2302	November 15, 2022
Setup of Dask on HPC Deploying Dask dask-mpi	3	110	November 8, 2024

gRPC crashes dask worker

Related topics