Hello,
I have been having very bad issues with the dask scheduler. I recently upgraded from dask/distributed version: 2022.6.1-py3.8 to 2023.7.0-py3.9 (while solving Workers constantly dying). I have been using dask to call subprocesses for long running exe simulations which worked quite well. After upgrading it seems to not behave even in the slightest.
No matter what settings I change, I can not seem to keep all workers working. It seems like the scheduler puts jobs on specific workers, and leave other workers completely idle until I resubmit my job workload. It seems to start off okay but eventually most workers start to sit idle. I have tried tuning some of the scheduler configuration to various values with no luck. Some of these probably have zero effect but these were the values I initially started varying.
os.environ['DASK_DISTRIBUTED__SCHEDULER__ACTIVE_MEMORY_MANAGER__START'] = 'True'
os.environ['DASK_DISTRIBUTED__SCHEDULER__ACTIVE_MEMORY_MANAGER__INTERVAL'] = '1s'
os.environ['DASK_DISTRIBUTED__SCHEDULER__ACTIVE_MEMORY_MANAGER__MEASURE'] = 'optimistic'
os.environ['DASK_DISTRIBUTED__COMM__RETRY__COUNT'] = '0' # Default == 0
os.environ['DASK_DISTRIBUTED__COMM__RETRY__DELAY__MIN'] = '1s' # default == '1s'
os.environ['DASK_DISTRIBUTED__DEPLOY__LOST-WORKER-TIMEOUT'] = '15s' # default == '15s'
os.environ['DASK_DISTRIBUTED__SCHEDULER__ALLOWED-FAILURES'] = '3' # default == 3
os.environ['DASK_DISTRIBUTED__SCHEDULER__WORKER-TTL'] = '5 minutes' # default == '5 minutes'
os.environ['DASK_DISTRIBUTED__SCHEDULER__WORK_STEALING'] = "True"
os.environ['DASK_DISTRIBUTED__SCHEDULER__DEFAULT_TASK_DURATIONS__SWEEP'] = '5min'
os.environ['DASK_DISTRIBUTED__SCHEDULER__WORKER_SATURATION'] = '1.1'
os.environ['DASK_DISTRIBUTED__SCHEDULER__UNKNOWN_TASK_DURATION'] = '1s'
os.environ['DASK_DISTRIBUTED__SCHEDULER__TRANSITION_LOG_LENGTH'] = '10000'
os.environ['DASK_DISTRIBUTED__SCHEDULER__EVENTS_LOG_LENGTH'] = '10000'
os.environ['DASK_DISTRIBUTED__SCHEDULER__DASHBOARD__STATUS__TASK_STREAM_LENGTH'] = '500'
os.environ['DASK_DISTRIBUTED__SCHEDULER__DASHBOARD__TASKS__TASK_STREAM_LENGTH'] = '50000'
Currently
The past
I have come across a lot of issues/documentation that might be helpful.
Similar to my problem I think
Came across but unsure if related
https://distributed.dask.org/en/latest/work-stealing.html
https://distributed.dask.org/en/latest/scheduling-policies.html
Any tips would be greatly appreciated!