With distributed, you should be able to view the Worker logs through the dashboard, do you have access to it? You should also view the error logs if the computation crash too many times and is interrupted, but this seems not to be the case.
The most common case of Worker restart is a memory problem.
Aside from taht, it would be much easier to help with some reproducer, do you think you could build one?
Yes I’m monitoring the workers’ logs from e.g. http://127.0.0.1:8787/info/logs/tcp%3A%2F%2F127.0.0.1%3A43603.html during training. But despite of the nanny says “restarted”, it’s not actually the same worker. The worker gets shut down and another worker spawns to replace it with different port and fresh logs. So as soon as the worker gets “restarted”, the URL for worker logs I just write becomes unreachable.
Yes but be careful, it must not be to big at this point!
The error might be triggered at operating system level, with an oom_killer mechanisms, leaving no chance to Dask to capture it.
Maybe it’s still a problem of chunk size, but less than the 45GiB I’m seeing above. Or it might mean it’s something totally different. The only way to know would be to have some reproducer with representative data sizes.