With distributed, you should be able to view the Worker logs through the dashboard, do you have access to it? You should also view the error logs if the computation crash too many times and is interrupted, but this seems not to be the case.
The most common case of Worker restart is a memory problem.
Aside from taht, it would be much easier to help with some reproducer, do you think you could build one?
Yes I’m monitoring the workers’ logs from e.g. http://127.0.0.1:8787/info/logs/tcp%3A%2F%2F127.0.0.1%3A43603.html during training. But despite of the nanny says “restarted”, it’s not actually the same worker. The worker gets shut down and another worker spawns to replace it with different port and fresh logs. So as soon as the worker gets “restarted”, the URL for worker logs I just write becomes unreachable.
I also monitor the workers from http://127.0.0.1:8787/workers tab and the memory doesn’t seem to reach too high at all. You can see how they restarted (or “replaced”) here:
And a reproducer might be difficult to create because I read data from parquet which has text and label column. But the ending array is like this:
Still, how could I see this without searching the problem, I mean there should be a log like " this worker has killed because it tried to allocate too much memory" or something?
If this is because of memory, how does it work with with dask.config.set(scheduler='single-threaded'): without any errors? I have 16G RAM
I tried:
x = x_train.rechunk(block_size_limit=100e6)
y = train['label'].to_dask_array(lengths=True)
y = y.rechunk((x.chunksize[0],))
I read data from partitioned parquet files. I repartitioned the data so that each part is 100 MB in size.
After transformation (preprocessing, vectorization etc.), it’s reasonable that the new vector array may have different sizes (in bytes) from the original files; it makes sense.
Yes but be careful, it must not be to big at this point!
The error might be triggered at operating system level, with an oom_killer mechanisms, leaving no chance to Dask to capture it.
Maybe it’s still a problem of chunk size, but less than the 45GiB I’m seeing above. Or it might mean it’s something totally different. The only way to know would be to have some reproducer with representative data sizes.