Workers get restarted but there are no error messages etc

daskee · July 17, 2023, 10:13am

I can’t figure out why the workers are getting restarted or how can I get this information? It doesn’t show the stderr or anything.

All configurations are on default. And the task I’m trying to do is a simple text classification.

Because of these workers are getting restarted, with dask.config.set(scheduler='single-threaded'): gets completed faster than distributed.

Also the restarts are unpredictable; sometimes, with the same task, workers get restarted x times, sometimtes y times, etc.

guillaumeeb · July 17, 2023, 1:12pm

Hi @daskee,

With distributed, you should be able to view the Worker logs through the dashboard, do you have access to it? You should also view the error logs if the computation crash too many times and is interrupted, but this seems not to be the case.

The most common case of Worker restart is a memory problem.

Aside from taht, it would be much easier to help with some reproducer, do you think you could build one?

daskee · July 18, 2023, 8:00am

Hi @guillaumeeb ,

Yes I’m monitoring the workers’ logs from e.g. http://127.0.0.1:8787/info/logs/tcp%3A%2F%2F127.0.0.1%3A43603.html during training. But despite of the nanny says “restarted”, it’s not actually the same worker. The worker gets shut down and another worker spawns to replace it with different port and fresh logs. So as soon as the worker gets “restarted”, the URL for worker logs I just write becomes unreachable.

This is a worker’s logs during training:

I also monitor the workers from http://127.0.0.1:8787/workers tab and the memory doesn’t seem to reach too high at all. You can see how they restarted (or “replaced”) here:
dask_restart

And a reproducer might be difficult to create because I read data from parquet which has text and label column. But the ending array is like this:

guillaumeeb · July 20, 2023, 11:16am

Hi, if I do the maths correctly, your output chunk sizes weight 45GiB! This is really huge and could clearly the cause of worker restarting.

daskee · July 21, 2023, 7:28am

Yes that’s a big sparse array, but:

Still, how could I see this without searching the problem, I mean there should be a log like " this worker has killed because it tried to allocate too much memory" or something?
If this is because of memory, how does it work with with dask.config.set(scheduler='single-threaded'): without any errors? I have 16G RAM
I tried:

x = x_train.rechunk(block_size_limit=100e6)
y = train['label'].to_dask_array(lengths=True)
y = y.rechunk((x.chunksize[0],))

does this make sense?

guillaumeeb · July 26, 2023, 9:14pm

I guess it is crashing too fast while trying to allocate memory.

That is a really good question!

Yes it does, but ideally, you should configure appropriate chunk size when reading the data. How do you read input data in the first place?

guillaumeeb · July 26, 2023, 9:15pm

The rechunk may arrive too late, after Dask tries to load a big chunk in memory before splitting it into smaller ones.

daskee · August 15, 2023, 2:58pm

Hi, so sorry that I totally missed your comment.

I read data from partitioned parquet files. I repartitioned the data so that each part is 100 MB in size.

After transformation (preprocessing, vectorization etc.), it’s reasonable that the new vector array may have different sizes (in bytes) from the original files; it makes sense.

But the mystrey of my questions 1 and 2 still on

guillaumeeb · August 21, 2023, 12:27pm

What is the original partition size?

Yes but be careful, it must not be to big at this point!

The error might be triggered at operating system level, with an oom_killer mechanisms, leaving no chance to Dask to capture it.

Maybe it’s still a problem of chunk size, but less than the 45GiB I’m seeing above. Or it might mean it’s something totally different. The only way to know would be to have some reproducer with representative data sizes.

Topic		Replies	Views
Dask Progress Tasks Restart Distributed dask-array , dashboard , distributed	2	351	May 11, 2022
Why did my worker restart? Distributed	4	2220	November 15, 2022
How to retry hanging jobs during a distributed computation Distributed dask-array , distributed	3	947	May 4, 2022
Run dask in parallel doesn't work as expected, in distributed kubernetes pods Distributed	11	498	March 17, 2023
Scheduling issues on SLURMCluster for transforming large arrays Distributed dask-jobqueue , scheduler	6	762	October 19, 2022

Workers get restarted but there are no error messages etc

Related topics