Best practice for shutting down a cluster

I’ve got a LocalCluster working well except for teardown. I’ve tried a number of functions, including:

  • client.close()
  • client.shutdown()
  • client.scheduler.shutdown()
  • client.retire_workers()
  • doing nothing

But I’ve found no combination that avoids a variety of errors on exit. All the examples I’ve found are interactive ipynb files that don’t ever terminate the cluster. So basic question:

What is the best practice for cleanly terminating a client / cluster?

Thanks.

Hi @rjplevin,

This is also something that I’ve experienced, and I’ve no good advice. I’d say the better to use is client.shutdown(), but it’s true in some workload you can get random errors… In general, you can safely ignore them, but I agree this isn’t really clean.

Is there any way to supress these errors or redirect them somewhere else?
I’ve tried:
- Redirecting stderr and stdout before calling client.shutdown();
- Changing the logging.distributed in the dask.config Configuration dictionary to error level;
- Using client.register_worker_callbacks() to call a function that silences logging everytime a worker is created.
However, I can’t seem to ever silence the WARNING messages I get when I shutdown the client. I am building an application using Dask and I would like the users to not see these “alarming” messages if they do not have to.

Hi @byrom771, welcome here!

Could you give us the errors you are seeing?

Hello @guillaumeeb ,
I am getting messages like this:
2024-03-04 18:02:56,695 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name=“execute(‘sample_refine_output-8f067a36-aa35-4f72-885e-dac1f39df7a9’)” coro=<Worker.execute() done, defined at pathtoenv/lib/python3.10/site-packages/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-03-04 18:02:59,891 - distributed.nanny - WARNING - Worker process still alive after 3.1999992370605472 seconds, killing
Several of each type, after I try to shutdown my client by doing client.cancel(futures) (where futures is the list of futures I have running) followed by client.shutdown().
So far the only way I have of suppressing these errors is to set the logging level to CRITICAL (with logging.getLogger(‘distributed’).setLevel(logging.CRITICAL)), both in the workers (by using client.register_worker_callbacks and calling a function that sets logging level) and in the client process (by simply setting the logging level to CRITICAL before I call client cancel and shutdown).
Is there any way to redirect these logging messages from the distributed logger to a file?
Is it possible to simply remove the console handler and add a file handler to the distributed logger manually? I have tried with no success, maybe I just don’t understand how the Python logging module works.
Should this be done through the dask config instead?
I appreciate the assistance.

Those errors look normal in your case.

If you want to hide them, yes, you should configure logging, this can be done using the dask config: Debug — Dask documentation.

Do not hesitate to share your configuration if you come up with something, this could help other users.

I tried creating a config.yaml file in {sys.prefix}/etc/dask/ and calling dask.refresh(), but that didn’t work for certain warnings so in the end I simply did this:

from distributed.utils import silence_logging_cmgr
with silence_logging_cmgr(logging.CRITICAL):
client.cancel(futures)
client.shutdown()

Which resulted in a sucessful supression of messages below CRITICAL logging level when shutting down the client.

1 Like