Modin is running dask unit tests that are very frequently getting stuck for several hours. I reproduced a particular such failure, closed the stuck test with control-C, and got some errors, the first of which was distributed.scheduler - ERROR - Couldn't gather keys
. This error appears after some workers crash because Worker exceeded 95% memory budget. Restarting
.
What’s going wrong? Is dask supposed to be able to recover from the worker failures?
I’m sorry I don’t have a minimal reproducible example, but I had to jump through a lot of hoops just to get this far in debugging the failure, which I can only reproduce in GitHub CI. It’s hard to pull the dask-specific parts out of Modin.
Error
modin/pandas/test/test_io.py::TestSql::test_read_sql_from_sql_server PASSED [ 98%]
modin/pandas/test/test_io.py::TestSql::test_read_sql_from_postgres ^CTokenization took: 0.02 ms
Type conversion took: 0.28 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.22 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.24 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.23 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.03 ms
Type conversion took: 0.25 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.19 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.22 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.20 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.20 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.19 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.17 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.19 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.17 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.19 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.18 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.31 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.27 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.42 ms
Type conversion took: 3.02 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.26 ms
Parser memory cleanup took: 0.01 ms
Tokenization took: 0.03 ms
Type conversion took: 0.35 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.30 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.32 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.30 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.33 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.29 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.28 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.29 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.28 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.29 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.28 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.24 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.24 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.25 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.26 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.24 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.21 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.22 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.20 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.03 ms
Type conversion took: 0.20 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.18 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.20 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.19 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.18 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.20 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.17 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.20 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.21 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.27 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.25 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.31 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.29 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.30 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.31 ms
Parser memory cleanup took: 0.01 ms
Tokenization took: 0.03 ms
Type conversion took: 0.33 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.41 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.34 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.29 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.31 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.29 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.28 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.30 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.28 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.30 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 0.02 ms
Type conversion took: 0.29 ms
Parser memory cleanup took: 0.00 ms
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.scheduler - ERROR - Couldn't gather keys {'lambda-23e1c0c33ee703d05b68f71eec559e49': [], 'lambda-4c13d6c07826f9fdf0bb8fcf2bcd2ad5': []} state: ['waiting', 'waiting'] workers: []
NoneType: None
distributed.scheduler - ERROR - Workers don't have promised key: [], lambda-23e1c0c33ee703d05b68f71eec559e49
NoneType: None
distributed.scheduler - ERROR - Workers don't have promised key: [], lambda-4c13d6c07826f9fdf0bb8fcf2bcd2ad5
NoneType: None
distributed.client - WARNING - Couldn't gather 2 keys, rescheduling {'lambda-23e1c0c33ee703d05b68f71eec559e49': (), 'lambda-4c13d6c07826f9fdf0bb8fcf2bcd2ad5': ()}
---------- coverage: platform linux, python 3.8.13-final-0 -----------
Coverage XML written to file coverage.xml
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/usr/share/miniconda3/envs/modin/lib/python3.8/threading.py:306: KeyboardInterrupt
More details in TEST: test_io.py on ubuntu + dask gets stuck most of the time in CI · Issue #4760 · modin-project/modin · GitHub