Troubleshooting intermittent hanging behavior with one worker stuck running

I am using xclim to do some bias-adjustment of climate data (zarr).

Things seems to be working until the very end, where processing seems to halt near completion. Below is what the dashboard looks like when it’s stuck.
Also it appears that some worker has been stuck processing something that will not finish, as seen by the CPU utilization.( I cannot embed more than one image so they are combined into one).

Sometimes it works, sometimes it hangs. I can’t really discern anything from the logs. But I’m also not sure I am looking in the right places. So I am just looking for some general advice for where to look / what to look for.
Dask hanging intermittently (key word) has been a pretty common for me, I even have shell scripts to run + restart things if they haven’t completed after some set time. I’m just getting to the point where I would really like to make things more robust! I am working on an condo-style HPC managed with slurm, mostly with xarray + netCDF and more recently zarr. I have tried using dask-jobqueue and get the same behavior for this particular processing pipeline I am working on.

Thank you!

Hi @Kyle_Redilla, welcome to Dask Discourse forum!

Okay, this will be difficult to investigate as is… If you don’t find anything in the Scheduler or Worker logs (especially the Worker handling the last task), then I don’t know where to look for…

I take you are using a LocalCluster in this example?

The best would be if you could provide a MVCE, as simple as possible, and with available data. At least try to identify some workflow where this is occuring, it looks like you are trying to write a result to disk?

I don’t think I can provide a MCVE. I’ve since found that this issue seems to be filesystem-dependent.

I have been running dask-based processing primarily on two different filesystems. One is Lustre, and one is BeeGFS. I’ve noticed that I only see issues like the above when reading/writing to the BeeGFS system. Running the exact same code on the Lustre filesystem (different pipelines, datasets etc) I have yet to experience this same hanging behavior. I notified our sysadmins of this but nothing has been figured out yet. I am exploring this myself because it would be great to be able to use this filesystem (there is much more storage).

To answer your question, yes in the above example I think I was using a LocalCluster. But this also happens when using a dask-jobqueue SLURMCluster. It hangs, usually near the end of completion (based on task graph), and some workers seem to be at 100% (or more) CPU, while most others are 0-2%.

When I search the worker logs for one of the addresses of the “racing” workers, such as grep -rl "42535" /worker/logs/, I get one .err file and there doesn’t seem to be any info besides “starting worker” and “listenting to”. There is nothing in the .out files. Here is an example .err file, with the only two instances of that particular worker highlighted (port number).

I have been trying to see if I can get more info written to the logs but am having trouble doing so. If there is a quick tip on doing this that would be great to know.

(apologies for the long delay)

You should be able to customize logging following Debug — Dask documentation, or How to debug — Dask-jobqueue 0.9.0+11.gd562e6c documentation.

You could look at other information on these pages. Maybe ultimately try to enable transition logs to understand what might be doing the tasks that are hanging near the end.

Considering your introduction on BeeGFS vs Lustre, I would think of race condition on writing some results on the file system, but it’s hard to tell without sysadmins help…