Dask dashboard crashes but client is still running

This doesn’t happen often, maybe once per 30 jobs, but when it does it is a bit annoying. Is there a way to restart the dashboard without restarting the client?

Many thanks!

Hi @liberabaci,

Which type of Distributed Cluster are you using? I’ve never heard about the Dashboard crashing before. How are you producing this behavior, is your workflow/workload complex?

Dashboard is started by the Scheduler, but I’m not sure there is a way to ckeck its state or simply restrt it…

Hi @guillaumeeb,
I’m using SlurmCluster. I think I figured out the issue. I’m starting the cluster with a walltime. How does client.restart affect the walltime? As in will that be reset to zero?

client.restart only restart all the Workers bug does not modify any walltime.

Could you explain what do you mean by starting the cluster with a walltime? Where are you creating the SLURMCluster object ?

This is my typical initial startup script:

def start_dask(num_workers=100):
    import os
    from dask_jobqueue import SLURMCluster
    import dask
    from dask.distributed import Client
    user = os.getlogin()
    dask.config.set({"distributed.scheduler.worker-ttl": None})
    scheduler_options = {
        "dashboard_address": ":8099",
    }
    cluster = SLURMCluster(
        cores=1, memory='50GB',processes=1,n_workers=num_workers,
        queue="reg", scheduler_options=scheduler_options,
        log_directory='/foo/%s/.logs' % user, local_directory='/foo/%s/.dask' % user,
        walltime="16:10:00",
    )
    client = Client(cluster)

    return client

Hi @guillaumeeb ,

Just got the error again. When the server crashes this is the traceback I’m seeing in the logs:

AttributeError: 'str' object has no attribute 'text'ESC[0m
ESC[31m2025.02.05 11:19:15.946 ERROR   tornado.application                      log_exception.1871 Uncaught exception GET /status ()
HTTPServerRequest(protocol='http', host=':8099', method='GET', uri='/status', version='HTTP/1.1', remote_ip='')
Traceback (most recent call last):
  File "/venv/lib64/python3.8/site-packages/tornado/web.py", line 1786, in _execute
    result = await result
  File "/venv/lib64/python3.8/site-packages/bokeh/server/views/doc_handler.py", line 54, in get
    session = await self.get_session()
  File "/venv/lib64/python3.8/site-packages/bokeh/server/views/session_handler.py", line 145, in get_session
    session = await self.application_context.create_session_if_needed(session_id, self.request, token)
  File "/venv/lib64/python3.8/site-packages/bokeh/server/contexts.py", line 242, in create_session_if_needed
    self._application.initialize_document(doc)
  File "/venv/lib64/python3.8/site-packages/bokeh/application/application.py", line 192, in initialize_document
    h.modify_document(doc)
  File "/venv/lib64/python3.8/site-packages/bokeh/application/handlers/function.py", line 142, in modify_document
    self._func(doc)
  File "/venv/lib64/python3.8/site-packages/distributed/utils.py", line 760, in wrapper
    return func(*args, **kwargs)
  File "/venv/lib64/python3.8/site-packages/distributed/dashboard/components/scheduler.py", line 4265, in status_doc
    cluster_memory.update()
  File "/venv/lib64/python3.8/site-packages/bokeh/core/property/validation.py", line 95, in func
    return input_function(*args, **kwargs)
  File "/venv/lib64/python3.8/site-packages/distributed/utils.py", line 760, in wrapper
    return func(*args, **kwargs)
  File "/venv/lib64/python3.8/site-packages/distributed/dashboard/components/scheduler.py", line 431, in update
    self.root.title.text = title
AttributeError: 'str' object has no attribute 'text'ESC[0m

I assume you start this on the login node, or equivalent, so without any walltime?

Do you see anything else? What happens then when you try to reload the dashboard?

I see you are using Python 3.8 which is quite old, what is your Python environment?

Yes, it is run on the login node without walltime.
When I try and reload the dashboard I get:
500: Internal Server Error

The venv is rather large. I’ll try and reproduce with a smaller env. I will also try and upgrade to python 3.10. thanks for the help.