First off, I’m new here. I’m hyped about Dask and only recently started learning.
If I should post this question elsewhere or frame it differently to be more in line with the forum standards, let me know.
Sometimes I find a function cannot (easily) be partitioned.
When most of the graph is partitioned but the graph ‘collapses’ into a single node this is bad for performance. I understand that. But let’s say this cannot be avoided and an operation cannot be partitioned. There is a second problem and that, of course, is memory.
When using a distributed scheduler, I may have several workers each of which has a certain memory limit. Say 10GB. When the data is partitioned, I can hold several 10GB chunks of data in memory. However, for the single threaded operation, all of the data will need to fit into the RAM allocated to the one worker. The other workers are not really active until this bottleneck has been passed.
Since the data exceeds this 10GB threshold (but does fit on the machine), we start spilling to disk.
This is a waste of resources for the other workers are not effectively using the amount of memory allocated to them.
So my question is, is there a way to give the worker that is handling the single threaded operation more working memory for the duration of that operation?