Dear Dask community,
I have a situation where several workers are reading data from disk. This data should then be processed by other GPU-accelerated workers. This system works as intended for small datasets. However, for large datasets, it goes wrong. The data reading workers are usually much faster than the GPU-acc workers. This leads to many objects in memory waiting to be processed, which would not be a problem if the data reading workers would keep those objects in their own memory. Unfortunately, they immediately transfer the objects to the GPU-acc workers, who cannot keep up, start having memory issues and spill to disk (which I want to avoid).
I have tried to avoid this by tinkering with the distributed.memory config settings, for example by setting the pause value to 0.5. It seems like this did stop the transfers, but it also caused the GPU-acc workers to stop processing the data…
Any ideas on how I can tell the scheduler to not immediately transfer the data to the GPU-acc workers?
Thanks!
Maxim