"Workers don't have promised key" error and delayed computation

I am attempting to create a somewhat large Dask array (10k x 10k which is ~1GB) many times in parallel using the delayed operator and the distributed scheduler. When crossing a certain threshold of iterations or array size, I get a “Workers don’t have promised key” error. I’ve tried several variations on this which do not cause this problem, and I would like to understand why the error is occurring in this particular case. Note that the error does not happen when: not using the distributed scheduler, when creating numpy arrays of the same size instead of dask arrays, when using fewer iterations and the same array size, and when using smaller arrays and the same number of iterations. I am using Dask version 2022.05.0

import dask.array as da
import dask
from dask.distributed import Client
import numpy as np

client = Client()

def repeat_func(nb_iters, arr_sz):
    def func():
        x = da.random.random((arr_sz, arr_sz)).compute()
        # The following works
        # x = np.random.random((arr_sz, arr_sz))
        del x

    results = [dask.delayed(func)() for _ in range(nb_iters)]
    return dask.compute(results)

# The following works.
# repeat_func(10, 10_000)
# The following works.
# repeat_func(80, 1_000)

# Fails with KeyError when nb_iters crosses the threshold from 40 to 80.
# This works when not using the distributed client.
repeat_func(80, 10_000)
Output
2022-08-02 16:06:50,581 - distributed.scheduler - ERROR - Couldn't gather keys {"('random_sample-75e2a3af25475b41ff19bdea11f02de1', 2, 0)": ['tcp://127.0.0.1:45339'], "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 1, 3)": ['tcp://127.0.0.1:45339'], "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 0, 1)": ['tcp://127.0.0.1:45339'], "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 3, 0)": ['tcp://127.0.0.1:45339'], "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 0, 0)": ['tcp://127.0.0.1:45339'], "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 1, 2)": ['tcp://127.0.0.1:45339'], "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 2, 2)": ['tcp://127.0.0.1:45339'], "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 0, 2)": ['tcp://127.0.0.1:45339'], "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 2, 1)": ['tcp://127.0.0.1:45339'], "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 3, 3)": ['tcp://127.0.0.1:45339'], "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 1, 1)": ['tcp://127.0.0.1:45339'], "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 1, 0)": ['tcp://127.0.0.1:45339'], "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 3, 2)": ['tcp://127.0.0.1:45339'], "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 0, 3)": ['tcp://127.0.0.1:45339'], "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 3, 1)": ['tcp://127.0.0.1:45339'], "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 2, 3)": ['tcp://127.0.0.1:45339']} state: ['processing', 'processing', 'processing', 'processing', 'processing', 'processing', 'processing', 'processing', 'processing', 'processing', 'processing', 'processing', 'processing', 'processing', 'processing', 'processing'] workers: ['tcp://127.0.0.1:45339']
NoneType: None
2022-08-02 16:06:50,621 - distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:45339'], ('random_sample-75e2a3af25475b41ff19bdea11f02de1', 2, 0)
NoneType: None
2022-08-02 16:06:50,623 - distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:45339'], ('random_sample-75e2a3af25475b41ff19bdea11f02de1', 1, 3)
NoneType: None
2022-08-02 16:06:50,627 - distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:45339'], ('random_sample-75e2a3af25475b41ff19bdea11f02de1', 0, 1)
NoneType: None
2022-08-02 16:06:50,629 - distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:45339'], ('random_sample-75e2a3af25475b41ff19bdea11f02de1', 3, 0)
NoneType: None
2022-08-02 16:06:50,631 - distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:45339'], ('random_sample-75e2a3af25475b41ff19bdea11f02de1', 0, 0)
NoneType: None
2022-08-02 16:06:50,632 - distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:45339'], ('random_sample-75e2a3af25475b41ff19bdea11f02de1', 1, 2)
NoneType: None
2022-08-02 16:06:50,634 - distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:45339'], ('random_sample-75e2a3af25475b41ff19bdea11f02de1', 2, 2)
NoneType: None
2022-08-02 16:06:50,636 - distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:45339'], ('random_sample-75e2a3af25475b41ff19bdea11f02de1', 0, 2)
NoneType: None
2022-08-02 16:06:50,637 - distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:45339'], ('random_sample-75e2a3af25475b41ff19bdea11f02de1', 2, 1)
NoneType: None
2022-08-02 16:06:50,637 - distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:45339'], ('random_sample-75e2a3af25475b41ff19bdea11f02de1', 3, 3)
NoneType: None
2022-08-02 16:06:50,638 - distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:45339'], ('random_sample-75e2a3af25475b41ff19bdea11f02de1', 1, 1)
NoneType: None
2022-08-02 16:06:50,638 - distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:45339'], ('random_sample-75e2a3af25475b41ff19bdea11f02de1', 1, 0)
NoneType: None
2022-08-02 16:06:50,640 - distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:45339'], ('random_sample-75e2a3af25475b41ff19bdea11f02de1', 3, 2)
NoneType: None
2022-08-02 16:06:50,640 - distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:45339'], ('random_sample-75e2a3af25475b41ff19bdea11f02de1', 0, 3)
NoneType: None
2022-08-02 16:06:50,641 - distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:45339'], ('random_sample-75e2a3af25475b41ff19bdea11f02de1', 3, 1)
NoneType: None
2022-08-02 16:06:50,649 - distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:45339'], ('random_sample-75e2a3af25475b41ff19bdea11f02de1', 2, 3)
NoneType: None
2022-08-02 16:06:50,651 - distributed.nanny - WARNING - Restarting worker
2022-08-02 16:06:50,650 - distributed.client - WARNING - Couldn't gather 16 keys, rescheduling {"('random_sample-75e2a3af25475b41ff19bdea11f02de1', 2, 0)": ('tcp://127.0.0.1:45339',), "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 1, 3)": ('tcp://127.0.0.1:45339',), "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 0, 1)": ('tcp://127.0.0.1:45339',), "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 3, 0)": ('tcp://127.0.0.1:45339',), "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 0, 0)": ('tcp://127.0.0.1:45339',), "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 1, 2)": ('tcp://127.0.0.1:45339',), "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 2, 2)": ('tcp://127.0.0.1:45339',), "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 0, 2)": ('tcp://127.0.0.1:45339',), "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 2, 1)": ('tcp://127.0.0.1:45339',), "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 3, 3)": ('tcp://127.0.0.1:45339',), "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 1, 1)": ('tcp://127.0.0.1:45339',), "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 1, 0)": ('tcp://127.0.0.1:45339',), "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 3, 2)": ('tcp://127.0.0.1:45339',), "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 0, 3)": ('tcp://127.0.0.1:45339',), "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 3, 1)": ('tcp://127.0.0.1:45339',), "('random_sample-75e2a3af25475b41ff19bdea11f02de1', 2, 3)": ('tcp://127.0.0.1:45339',)}
2022-08-02 16:08:34,821 - distributed.worker_memory - WARNING - Worker exceeded 95% memory budget. Restarting
2022-08-02 16:08:35,005 - distributed.scheduler - ERROR - Couldn't gather keys {"('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 1, 0)": [], "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 2, 3)": [], "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 1, 2)": [], "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 0, 2)": [], "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 1, 3)": [], "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 2, 0)": [], "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 3, 1)": [], "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 2, 2)": [], "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 1, 1)": [], "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 2, 1)": [], "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 3, 2)": [], "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 0, 3)": [], "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 0, 1)": [], "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 3, 0)": [], "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 0, 0)": []} state: ['processing', 'processing', 'processing', 'processing', 'processing', 'processing', 'processing', 'processing', 'processing', 'processing', 'processing', 'processing', 'processing', 'processing', 'processing'] workers: []
NoneType: None
2022-08-02 16:08:35,006 - distributed.scheduler - ERROR - Workers don't have promised key: [], ('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 1, 0)
NoneType: None
2022-08-02 16:08:35,007 - distributed.scheduler - ERROR - Workers don't have promised key: [], ('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 2, 3)
NoneType: None
2022-08-02 16:08:35,007 - distributed.scheduler - ERROR - Workers don't have promised key: [], ('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 1, 2)
NoneType: None
2022-08-02 16:08:35,014 - distributed.scheduler - ERROR - Workers don't have promised key: [], ('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 0, 2)
NoneType: None
2022-08-02 16:08:35,015 - distributed.scheduler - ERROR - Workers don't have promised key: [], ('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 1, 3)
NoneType: None
2022-08-02 16:08:35,015 - distributed.scheduler - ERROR - Workers don't have promised key: [], ('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 2, 0)
NoneType: None
2022-08-02 16:08:35,016 - distributed.scheduler - ERROR - Workers don't have promised key: [], ('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 3, 1)
NoneType: None
2022-08-02 16:08:35,049 - distributed.scheduler - ERROR - Workers don't have promised key: [], ('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 2, 2)
NoneType: None
2022-08-02 16:08:35,050 - distributed.scheduler - ERROR - Workers don't have promised key: [], ('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 1, 1)
NoneType: None
2022-08-02 16:08:35,053 - distributed.scheduler - ERROR - Workers don't have promised key: [], ('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 2, 1)
NoneType: None
2022-08-02 16:08:35,054 - distributed.scheduler - ERROR - Workers don't have promised key: [], ('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 3, 2)
NoneType: None
2022-08-02 16:08:35,054 - distributed.scheduler - ERROR - Workers don't have promised key: [], ('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 0, 3)
NoneType: None
2022-08-02 16:08:35,056 - distributed.scheduler - ERROR - Workers don't have promised key: [], ('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 0, 1)
NoneType: None
2022-08-02 16:08:35,056 - distributed.scheduler - ERROR - Workers don't have promised key: [], ('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 3, 0)
NoneType: None
2022-08-02 16:08:35,057 - distributed.scheduler - ERROR - Workers don't have promised key: [], ('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 0, 0)
NoneType: None
2022-08-02 16:08:35,058 - distributed.client - WARNING - Couldn't gather 15 keys, rescheduling {"('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 1, 0)": (), "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 2, 3)": (), "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 1, 2)": (), "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 0, 2)": (), "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 1, 3)": (), "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 2, 0)": (), "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 3, 1)": (), "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 2, 2)": (), "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 1, 1)": (), "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 2, 1)": (), "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 3, 2)": (), "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 0, 3)": (), "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 0, 1)": (), "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 3, 0)": (), "('random_sample-ebb4cf4fc81866ce0245a5c80bb691a1', 0, 0)": ()}
2022-08-02 16:08:35,063 - distributed.nanny - WARNING - Restarting worker
  1. you should not mix dask collections (arrays in this case) inside of delayed functions. Why not just call dask.array from the main code? It already has a lazy API for all operations
  2. your function is not pure, each iteraion gives a different result, and this is where dask is getting confused. You would pass pure=False to the Delayed constructor.
1 Like