How do I solve "distributed.scheduler - ERROR - Couldn't gather keys"?

Modin is running dask unit tests that are very frequently getting stuck for several hours. I reproduced a particular such failure, closed the stuck test with control-C, and got some errors, the first of which was distributed.scheduler - ERROR - Couldn't gather keys. This error appears after some workers crash because Worker exceeded 95% memory budget. Restarting.

What’s going wrong? Is dask supposed to be able to recover from the worker failures?

I’m sorry I don’t have a minimal reproducible example, but I had to jump through a lot of hoops just to get this far in debugging the failure, which I can only reproduce in GitHub CI. It’s hard to pull the dask-specific parts out of Modin.

Error
modin/pandas/test/test_io.py::TestSql::test_read_sql_from_sql_server PASSED                                                      [ 98%]
modin/pandas/test/test_io.py::TestSql::test_read_sql_from_postgres ^CTokenization took: 0.02 ms
Type conversion took: 0.28 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.22 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.24 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.23 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.03 ms
Type conversion took: 0.25 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.19 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.22 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.20 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.20 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.19 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.17 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.19 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.17 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.19 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.18 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.31 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.27 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.42 ms
Type conversion took: 3.02 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.26 ms
Parser memory cleanup took: 0.01 ms
Tokenization took: 0.03 ms
Type conversion took: 0.35 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.30 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.32 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.30 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.33 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.29 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.28 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.29 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.28 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.29 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.28 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.24 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.24 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.25 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.26 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.24 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.21 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.22 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.20 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.03 ms
Type conversion took: 0.20 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.18 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.20 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.19 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.18 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.20 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.17 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.20 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.21 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.27 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.25 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.31 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.29 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.30 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.31 ms
Parser memory cleanup took: 0.01 ms
Tokenization took: 0.03 ms
Type conversion took: 0.33 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.41 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.34 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.29 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.31 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.29 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.28 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.30 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.28 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.30 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.29 ms
Parser memory cleanup took: 0.00 ms
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.scheduler - ERROR - Couldn't gather keys {'lambda-23e1c0c33ee703d05b68f71eec559e49': [], 'lambda-4c13d6c07826f9fdf0bb8fcf2bcd2ad5': []} state: ['waiting', 'waiting'] workers: []
NoneType: None
distributed.scheduler - ERROR - Workers don't have promised key: [], lambda-23e1c0c33ee703d05b68f71eec559e49
NoneType: None
distributed.scheduler - ERROR - Workers don't have promised key: [], lambda-4c13d6c07826f9fdf0bb8fcf2bcd2ad5
NoneType: None
distributed.client - WARNING - Couldn't gather 2 keys, rescheduling {'lambda-23e1c0c33ee703d05b68f71eec559e49': (), 'lambda-4c13d6c07826f9fdf0bb8fcf2bcd2ad5': ()}


---------- coverage: platform linux, python 3.8.13-final-0 -----------
Coverage XML written to file coverage.xml


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/usr/share/miniconda3/envs/modin/lib/python3.8/threading.py:306: KeyboardInterrupt

More details in TEST: test_io.py on ubuntu + dask gets stuck most of the time in CI · Issue #4760 · modin-project/modin · GitHub

@mvashishtha Looking at the issue you’ve linked to, some thoughts:

  • the failures seem to have started somewhere between 9 days ago to 4 days ago (based on CI runs)
  • since the Dask version is pinned, we think this might be related to some other package that changed between ci passing and failing

Here’s a diff between this(passing) and this(failing) – red is failing version, green is passing:

Full diff:
diff --git a/modin-ubuntu-dask-engine-fail.txt b/modin-ubuntu-dask-engine-pass.txt
index 154ee88..2cf16fa 100644
--- a/modin-ubuntu-dask-engine-fail.txt
+++ b/modin-ubuntu-dask-engine-pass.txt
@@ -12,11 +12,11 @@ aiosignal                 1.2.0              pyhd8ed1ab_0    conda-forge
 alabaster                 0.7.12                   pypi_0    pypi
 argon2-cffi               21.3.0                   pypi_0    pypi
 argon2-cffi-bindings      21.2.0                   pypi_0    pypi
-arrow-cpp                 8.0.1           py38ha7276ea_0_cpu    conda-forge
+arrow-cpp                 8.0.0           py38ha7276ea_4_cpu    conda-forge
 asttokens                 2.0.5              pyhd8ed1ab_0    conda-forge
 asv                       0.5.dev1889+ef016e23          pypi_0    pypi
 async-timeout             4.0.2              pyhd8ed1ab_0    conda-forge
-attrs                     22.1.0             pyh71513ae_1    conda-forge
+attrs                     21.4.0             pyhd8ed1ab_0    conda-forge
 aws-c-cal                 0.5.11               h95a6274_0    conda-forge
 aws-c-common              0.6.2                h7f98852_0    conda-forge
 aws-c-event-stream        0.2.7               h3541f99_13    conda-forge
@@ -69,7 +69,7 @@ et_xmlfile                1.0.1                   py_1001    conda-forge
 execnet                   1.9.0              pyhd8ed1ab_0    conda-forge
 executing                 0.9.1              pyhd8ed1ab_0    conda-forge
 faker                     13.15.1                  pypi_0    pypi
-fastavro                  1.5.4            py38h0a891b7_0    conda-forge
+fastavro                  1.5.3            py38h0a891b7_0    conda-forge
 fastjsonschema            2.16.1                   pypi_0    pypi
 feather-format            0.4.1              pyh9f0ad1d_0    conda-forge
 filelock                  3.7.1                    pypi_0    pypi
@@ -77,7 +77,7 @@ flake8                    4.0.1                    pypi_0    pypi
 freetds                   1.1.15               h94af77a_1    conda-forge
 freetype                  2.10.4               h0708190_1    conda-forge
 frozenlist                1.3.0            py38h0a891b7_1    conda-forge
-fsspec                    2022.7.1           pyhd8ed1ab_0    conda-forge
+fsspec                    2022.5.0           pyhd8ed1ab_0    conda-forge
 fuzzydata                 0.0.6                    pypi_0    pypi
 gflags                    2.2.2             he1b5a44_1004    conda-forge
 giflib                    5.2.1                h36c2ea0_2    conda-forge
@@ -100,7 +100,7 @@ grpc-cpp                  1.45.2               h3b8df00_4    conda-forge
 grpcio                    1.43.0                   pypi_0    pypi
 hdf5                      1.12.1          nompi_h2386368_104    conda-forge
 heapdict                  1.0.1                      py_0    conda-forge
-icu                       67.1                 he1b5a44_0    conda-forge
+icu                       58.2              hf484d3e_1000    conda-forge
 idna                      3.3                pyhd8ed1ab_0    conda-forge
 imagesize                 1.4.1                    pypi_0    pypi
 importlib-metadata        4.12.0                   pypi_0    pypi
@@ -115,7 +115,7 @@ jinja2                    3.1.2              pyhd8ed1ab_1    conda-forge
 jmespath                  1.0.1              pyhd8ed1ab_0    conda-forge
 joblib                    1.1.0              pyhd8ed1ab_0    conda-forge
 jpeg                      9e                   h166bdaf_2    conda-forge
-jsonschema                4.9.0                    pypi_0    pypi
+jsonschema                4.7.2                    pypi_0    pypi
 jupyter                   1.0.0                    pypi_0    pypi
 jupyter-client            7.3.4                    pypi_0    pypi
 jupyter-console           6.4.4                    pypi_0    pypi
@@ -127,7 +127,7 @@ kiwisolver                1.4.4            py38h43d8883_0    conda-forge
 krb5                      1.19.3               h3790be6_0    conda-forge
 lcms2                     2.12                 hddcbb42_0    conda-forge
 ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
-lerc                      4.0.0                h27087fc_0    conda-forge
+lerc                      3.0                  h9c3ff4c_0    conda-forge
 libblas                   3.9.0           15_linux64_openblas    conda-forge
 libbrotlicommon           1.0.9                h166bdaf_7    conda-forge
 libbrotlidec              1.0.9                h166bdaf_7    conda-forge
@@ -158,29 +158,29 @@ libsodium                 1.0.18               h36c2ea0_1    conda-forge
 libssh2                   1.10.0               ha56f1ee_2    conda-forge
 libstdcxx-ng              12.1.0              ha89aaad_16    conda-forge
 libthrift                 0.16.0               h519c5ea_1    conda-forge
-libtiff                   4.4.0                h0d92c0b_2    conda-forge
+libtiff                   4.4.0                hc85c160_1    conda-forge
 libutf8proc               2.7.0                h7f98852_0    conda-forge
 libuuid                   2.32.1            h7f98852_1000    conda-forge
 libwebp                   1.2.3                h522a892_1    conda-forge
 libwebp-base              1.2.3                h166bdaf_2    conda-forge
 libxcb                    1.13              h7f98852_1004    conda-forge
-libxml2                   2.9.10               h68273f3_2    conda-forge
-libxslt                   1.1.33               hf705e74_1    conda-forge
+libxml2                   2.9.14               h74e7548_0
+libxslt                   1.1.35               h4e12654_0
 libzlib                   1.2.12               h166bdaf_2    conda-forge
 locket                    1.0.0              pyhd8ed1ab_0    conda-forge
-lxml                      4.8.0            py38h0a891b7_3    conda-forge
+lxml                      4.9.1            py38h0a891b7_0    conda-forge
 lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
 lzo                       2.10              h516909a_1000    conda-forge
 markupsafe                2.1.1            py38h0a891b7_1    conda-forge
 matplotlib                3.2.2                         1    conda-forge
-matplotlib-base           3.2.2            py38h5d868c9_1    conda-forge
+matplotlib-base           3.2.2            py38hef1b27d_0
 matplotlib-inline         0.1.3              pyhd8ed1ab_0    conda-forge
 mccabe                    0.6.1                    pypi_0    pypi
 mistune                   0.8.4                    pypi_0    pypi
 modin-spreadsheet         0.1.2                    pypi_0    pypi
 msgpack-python            1.0.4            py38h43d8883_0    conda-forge
 multidict                 6.0.2            py38h0a891b7_1    conda-forge
-mypy                      0.971            py38h0a891b7_0    conda-forge
+mypy                      0.961            py38h0a891b7_0    conda-forge
 mypy_extensions           0.4.3            py38h578d9bd_5    conda-forge
 nbclient                  0.6.6                    pypi_0    pypi
 nbconvert                 6.5.0                    pypi_0    pypi
@@ -204,7 +204,7 @@ orc                       1.7.5                h6c59b99_0    conda-forge
 packaging                 21.3               pyhd8ed1ab_0    conda-forge
 pandas                    1.4.3            py38h47df419_0    conda-forge
 pandas-gbq                0.17.6             pyh6c4a22f_0    conda-forge
-pandas-stubs              1.4.3.220801       pyhd8ed1ab_0    conda-forge
+pandas-stubs              1.4.3.220724       pyhd8ed1ab_0    conda-forge
 pandocfilters             1.5.0                    pypi_0    pypi
 paramiko                  2.11.0             pyhd8ed1ab_0    conda-forge
 parquet-cpp               1.5.1                         2    conda-forge
@@ -214,8 +214,7 @@ pathspec                  0.9.0                    pypi_0    pypi
 pexpect                   4.8.0              pyh9f0ad1d_2    conda-forge
 pickleshare               0.7.5                   py_1003    conda-forge
 pillow                    9.2.0            py38h0ee0e06_0    conda-forge
-pip                       22.2.1             pyhd8ed1ab_0    conda-forge
-pkgutil-resolve-name      1.3.10                   pypi_0    pypi
+pip                       22.2               pyhd8ed1ab_0    conda-forge
 platformdirs              2.5.2                    pypi_0    pypi
 pluggy                    1.0.0            py38h578d9bd_3    conda-forge
 plumbum                   1.7.2              pyhd8ed1ab_0    conda-forge
@@ -231,7 +230,7 @@ pure_eval                 0.2.2              pyhd8ed1ab_0    conda-forge
 py                        1.11.0             pyh6c4a22f_0    conda-forge
 py-cpuinfo                8.0.0              pyhd8ed1ab_0    conda-forge
 py-spy                    0.3.12                   pypi_0    pypi
-pyarrow                   8.0.1           py38h9f6a473_0_cpu    conda-forge
+pyarrow                   8.0.0           py38h9f6a473_4_cpu    conda-forge
 pyasn1                    0.4.8                      py_0    conda-forge
 pyasn1-modules            0.2.7                      py_0    conda-forge
 pycodestyle               2.8.0                    pypi_0    pypi
@@ -249,7 +248,7 @@ pyrsistent                0.18.1                   pypi_0    pypi
 pysocks                   1.7.1            py38h578d9bd_5    conda-forge
 pytables                  3.7.0            py38hdb04529_0    conda-forge
 pytest                    7.1.2            py38h578d9bd_0    conda-forge
-pytest-benchmark          3.4.1              pyhd8ed1ab_1    conda-forge
+pytest-benchmark          3.4.1              pyhd8ed1ab_0    conda-forge
 pytest-cov                2.11.0             pyh44b312d_0    conda-forge
 pytest-forked             1.4.0              pyhd8ed1ab_0    conda-forge
 pytest-xdist              2.5.0              pyhd8ed1ab_0    conda-forge
@@ -271,10 +270,10 @@ requests-oauthlib         1.3.1              pyhd8ed1ab_0    conda-forge
 rpyc                      4.1.5              pyh9f0ad1d_1    conda-forge
 rsa                       4.9                pyhd8ed1ab_0    conda-forge
 s2n                       1.0.10               h9b69904_0    conda-forge
-s3fs                      2022.7.1           pyhd8ed1ab_0    conda-forge
+s3fs                      2022.5.0           pyhd8ed1ab_0    conda-forge
 s3transfer                0.5.2              pyhd8ed1ab_0    conda-forge
 scikit-learn              1.1.1            py38hf80bbf7_0    conda-forge
-scipy                     1.9.0            py38hea3f02b_0    conda-forge
+scipy                     1.8.1            py38hea3f02b_2    conda-forge
 send2trash                1.8.0                    pypi_0    pypi
 setuptools                59.8.0           py38h578d9bd_1    conda-forge
 six                       1.16.0             pyh6c4a22f_0    conda-forge
@@ -309,7 +308,7 @@ typing_extensions         4.3.0              pyha770c72_0    conda-forge
 typing_inspect            0.7.1              pyh6c4a22f_0    conda-forge
 unixodbc                  2.3.10               h583eb01_0    conda-forge
 urllib3                   1.26.11            pyhd8ed1ab_0    conda-forge
-virtualenv                20.16.2                  pypi_0    pypi
+virtualenv                20.16.1                  pypi_0    pypi
 wcwidth                   0.2.5              pyh9f0ad1d_2    conda-forge
 webencodings              0.5.1                    pypi_0    pypi
 wheel                     0.37.1             pyhd8ed1ab_0    conda-forge

Dask’s test suite was also stalling around the same time, see: Remove `werkzeug` pin in CI · Issue #9323 · dask/dask · GitHub (and relevant PR)

With this context, maybe the issue is in the s3fs version:

-s3fs                      2022.7.1           pyhd8ed1ab_0    conda-forge
+s3fs                      2022.5.0           pyhd8ed1ab_0    conda-forge

You can try pinning s3fs or the same package as Dask(werkzeug) and see if that helps?

@pavithraes it turns out I was able to fix the error by increasing the dask memory limit: TEST-#4760: Limit only ray memory to 1 GB. by mvashishtha · Pull Request #4768 · modin-project/modin · GitHub

It looks like dask was going to wait for eternity on two keys that it was never going to get because they were for values computed by workers that crashed.

It wasn’t clear at all to me that the way to un-stuck dask was to increase its memory. Dask should also not hang forever when workers crash. Can the Dask team please prioritize fixing this bug?

Thanks for the details!

It looks like dask was going to wait for eternity on two keys that it was never going to get because they were for values computed by workers that crashed.

dask/distributed has been updated quite a lot since the pinned version in the modin CI, could you please confirm if the above issue persists with a more recent version of Dask?

@pavithraes

could you please confirm if the above issue persists with a more recent version of Dask?

On my mac, even with dask 2022.7.1, if I run the same test with memory limited to 1 GB with MODIN_ENGINE=DASK MODIN_MEMORY=1000000000 pytest modin/pandas/test/test_io.py --full-trace , the test gets stuck for several minutes after 160 test cases. Once I remove the memory limit, the test completes without getting stuck anywhere.

So the bug seems to apply dask 2022.7.1 as well.

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS Monterey
  • Computer model: MacBook Pro (16-inch, 2019)
  • Memory: 16 GB 2667 MHz DDR4
  • Processor: 2.3 GHz 8-Core Intel Core i9
  • Modin version (modin.__version__): 3f985ed6864cc1b5b587094d75ca5b2695e4139f
  • Python version: 3.10.4
  • Code we can use to reproduce:
1 Like