Note: I’ve asked this prevoisly on github, but agree that this is likely better discussed (first?) on discourse. Now that I’m here, I’d like to re-state this question for a discussion:
I’m wondering how to best handle reads from high latency storage (e.g. tape) or other tasks which may take a long time without the task itself making any progress (i.e. the task has to wait for something else). In this setting, I’ll maybe have a task graph which contains many parallel reads of different chunks, each of which may either be only on tape or may already be available on disk cache. If I start reading data which is still on tape, this read will encounter a large delay. But maybe that time could be used better if this read attempt triggers loading from tape while the dask scheduler would continue running other tasks which may have their data readily available. If none of the data is readily available, this would lead to a quick pass across all necessary tape reads and the second pass would benefit from all data being available on disk.
So I’m wondering if it would be possible to notify the scheduler from an already running task about the request to reschedule the task itself at a later time, e.g.:
def some_read_function():
...
if not_yet_available:
trigger_tape_load()
raise RetryLater(estimated_delay=5*60) # e.g. seconds, could also be np.timedelta64() etc...
A scheduler could catch this exception and might adjust its scheduling strategy accordingly.
Is something like this behaviour already possible? Are there better options? Would it be feasible to implement something like this using dask?
I’m tempted to think about something like an async task at this point, but probably that would be a bad idea, because a suspended async task would require some memory while an aborted and rescheduled task wouldn’t require any memory to be kept.