I have some python code that can sometimes use a high amount of memory, but normally, not. Determining it’s memory footprint before running is not a straightforward heuristic / calculation so is currently no solved.
I was wanting to set up a few workers with high memory limits, relative to the regular workers. And optimistically try and run that task anywhere on the cluster, and on a
MemoryError reschedule the task to be reattempted on the worker with a higher memory capacity.
I’ve read through the distributed documentation, under the “build understanding” section. So I’m aware there are a few options available for scheduling work. However, none of them seem to stand out as letting me get the kind of behaviour described above.
Could anyone offer any suggestions on where they think this kind of logic might best sit?
I don’t want to just limit all runs using the worker resources feature, as most of the time memory isn’t an issue, and I want to use the full cluster.
Thanks for any suggestions.