I have a simple filter job (based on a value being greater than 0.5) running on about 11 Gibs of data, and no repartioning/shuffling. I’m seeing bytes stored up to 100 GB sometimes during the run, and the job get stuck with the following error message once it has already finished the jobs that writes to disk
Event loop was unresponsive in Nanny for 3.09s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability
The machine has 1 TB of memory.
Also, the dask dashboard often hangs, where the UI is not loading. Wondering if folks might have pointers!
Any ideas? Thanks!