AssertionError: Status.running

llodds · March 7, 2022, 12:03am

Hi, I am using Dask to create a small SGE cluster and run jobs over there. Occasionally I would see the following error when closing the client and the cluster. Does anyone has any idea what’s going on here? Thanks.

  File "dask_utils.py", line 62, in dask_del
    cluster.close()
  File "/hpc/apps/pyhpc/dist/conda/x86_64/envs/cuda-11.0/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 110, in close
    return self.sync(self._close, callback_timeout=timeout)
  File "/hpc/apps/pyhpc/dist/conda/x86_64/envs/cuda-11.0/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 189, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/hpc/apps/pyhpc/dist/conda/x86_64/envs/cuda-11.0/lib/python3.8/site-packages/distributed/utils.py", line 351, in sync
    raise exc.with_traceback(tb)
  File "/hpc/apps/pyhpc/dist/conda/x86_64/envs/cuda-11.0/lib/python3.8/site-packages/distributed/utils.py", line 334, in f
    result[0] = yield future
  File "/hpc/apps/pyhpc/dist/conda/x86_64/envs/cuda-11.0/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/hpc/apps/pyhpc/dist/conda/x86_64/envs/cuda-11.0/lib/python3.8/site-packages/distributed/deploy/spec.py", line 418, in _close
    assert w.status == Status.closed, w.status
AssertionError: Status.running
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/hpc/apps/pyhpc/dist/conda/x86_64/envs/cuda-11.0/lib/python3.8/site-packages/distributed/deploy/spec.py", line 652, in close_clusters
    cluster.close(timeout=10)
  File "/hpc/apps/pyhpc/dist/conda/x86_64/envs/cuda-11.0/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 110, in close
    return self.sync(self._close, callback_timeout=timeout)
  File "/hpc/apps/pyhpc/dist/conda/x86_64/envs/cuda-11.0/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 189, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/hpc/apps/pyhpc/dist/conda/x86_64/envs/cuda-11.0/lib/python3.8/site-packages/distributed/utils.py", line 348, in sync
    e.wait(10)
  File "/hpc/apps/pyhpc/dist/conda/x86_64/envs/cuda-11.0/lib/python3.8/threading.py", line 558, in wait
    signaled = self._cond.wait(timeout)
  File "/hpc/apps/pyhpc/dist/conda/x86_64/envs/cuda-11.0/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)

This is my handy cleanup routine:

def dask_del(cluster, client, odask):
    """remove a dask job as well as tmp directory"""
    
    client.close()
    cluster.close()
    
    # delete all dask file
    for file in os.listdir(odask):
        file_path =  os.path.join(odask, file)
        try: 
            if os.path.isfile(file_path):
                os.unlink(file_path)
            elif os.path.isdir(file_path): shutil.rmtree(file_path)
        except Exception as e:
            print(e)

scharlottej13 · March 7, 2022, 5:40pm

@llodds thanks for the question and welcome to discourse! I’m not quite sure what’s going on here-- I would expect calling cluster.close() on the SGE cluster would work. @guillaumeeb do you have any thoughts on this?

guillaumeeb · March 7, 2022, 8:42pm

Hi @llodds,

I have to admit I’ve already seen exceptions when closing cluster objects, which is unfortunate. I’m not sure if it can occur with all SpecCluster kind of objects, or only with dask-jobqueue. I usually use dask-jobqueue in an interactive way through Jupyter notebooks, and I just stop/restart the underlying kernel when I want to release cluster resources… So never bothered with this issue.

Anyway, any help to understand this issue would be welcomed. Maybe an issue into dask-jobqueue github repository would be better to discuss this? By checking the stacktrace you have, it looks like a timeout issue: cluster took too much time to close itself?

llodds · March 12, 2022, 8:47pm

Thank you @scharlottej13 and @guillaumeeb

I tried time.sleep(5) in dask_del() before client.close() and now I never saw the error again. I guess dask needs some time to fully ramp up itself before getting closed.

Topic		Replies	Views
Running a cluster on an unreliable network Distributed dask-jobqueue , distributed	4	148	July 13, 2024
Memory Management of Dask Cluster and a few new user questions Distributed distributed	15	1439	March 13, 2024
Inference issue using dask-jobqueue in Slurm cluster Distributed distributed	3	343	March 31, 2023
How does batch runner setup dask worker Distributed dask-jobqueue , distributed	3	33	February 7, 2025
Dask SSH cluster - task running and SSH keys update Distributed delayed , distributed	3	67	May 15, 2024

AssertionError: Status.running

Related topics