Troubleshooting intermittent hanging behavior with one worker stuck running

I am using xclim to do some bias-adjustment of climate data (zarr).

Things seems to be working until the very end, where processing seems to halt near completion. Below is what the dashboard looks like when it’s stuck.
Also it appears that some worker has been stuck processing something that will not finish, as seen by the CPU utilization.( I cannot embed more than one image so they are combined into one).

Sometimes it works, sometimes it hangs. I can’t really discern anything from the logs. But I’m also not sure I am looking in the right places. So I am just looking for some general advice for where to look / what to look for.
Dask hanging intermittently (key word) has been a pretty common for me, I even have shell scripts to run + restart things if they haven’t completed after some set time. I’m just getting to the point where I would really like to make things more robust! I am working on an condo-style HPC managed with slurm, mostly with xarray + netCDF and more recently zarr. I have tried using dask-jobqueue and get the same behavior for this particular processing pipeline I am working on.

Thank you!

Hi @Kyle_Redilla, welcome to Dask Discourse forum!

Okay, this will be difficult to investigate as is… If you don’t find anything in the Scheduler or Worker logs (especially the Worker handling the last task), then I don’t know where to look for…

I take you are using a LocalCluster in this example?

The best would be if you could provide a MVCE, as simple as possible, and with available data. At least try to identify some workflow where this is occuring, it looks like you are trying to write a result to disk?