Client does not return workers, Job dies quickly

josephjohnjj · July 20, 2023, 7:46am

Hi,
I was trying to use Dask-Jobqueue to launch a Dask cluster using PBS.

The PBS in my cluster uses -P to denote the project. So I changed -A in dask_jobqueue/pbs.py to use -P.

The PBS in my cluster also requires ncpus and mem directly as PBS directive, It does not allows to allow select. So I did a workaround

cluster = PBSCluster(processes=2,cores=48,memory="192GB",local_directory='$TMPDIR',project='vp91',queue='normal',resource_spec='ncpus=48 \n#PBS -l mem=192GB',walltime='02:00:00')

When I check the job script it seems to generate the expected script

In [4]: print(cluster.job_script())
/scratch/vp91/Training/environment/lib/python3.11/site-packages/dask_jobqueue/pbs.py:82: FutureWarning: project has been renamed to account as this kwarg was used wit -A option. You are still using it (please also check config files). If you did not set account yet, project will be respected for now, but it will be removed in a future release. If you already set account, project is ignored and you can remove it.
  warnings.warn(warn, FutureWarning)
#!/usr/bin/env bash

#PBS -N dask-worker
#PBS -q normal
#PBS -P vp91
#PBS -l ncpus=48
#PBS -l mem=192GB
#PBS -l walltime=02:00:00

/scratch/vp91/Training/environment/bin/python -m distributed.cli.dask_worker tcp://203.0.19.135:43161 --nthreads 21 --nworkers 2 --memory-limit 89.41GiB --name dummy-name --nanny --death-timeout 60 --local-directory $TMPDIR

When I check the jobqueue it show that the job has been launched correctly

Job id                 Name             User              Time Use S Queue
---------------------  ---------------- ----------------  -------- - -----
90829702.gadi-pbs      dask-worker      jj8451            00:00:00 R normal-exec

But when I check the clients this is returned:

In [6]: client = Client(cluster)

In [7]: client
Out[7]: <Client: 'tcp://203.0.19.135:43161' processes=0 threads=0, memory=0 B>

Moreover, the job dies of quickly. Any suggestions as to what I am doing wrong?

guillaumeeb · July 20, 2023, 2:09pm

Hi @josephjohnjj, welcome to Dask community!

The first thing you should do is retrieve the output logs of the PBS job running the worker. If it dies quicly, either there is an error during its startup, either it cannot connect to the Scheduler and shuts down itself.

josephjohnjj · July 21, 2023, 12:32am

Hi @guillaumeeb,

Thanks for getting back to me. The log from PBS shows the following

Last login: Fri Jul 21 10:28:07 on ttys001
2023-07-20 17:31:12,918 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.6.66.63:42371'. Reason: nanny-close
2023-07-20 17:31:12,918 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.6.66.63:41911'. Reason: nanny-close
2023-07-20 17:31:12,919 - distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/utils.py", line 1873, in wait_for
    return await fut
           ^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/comm/tcp.py", line 491, in connect
    stream = await self.client.connect(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/tornado/tcpclient.py", line 279, in connect
    af, addr, stream = await connector.start(connect_timeout=timeout)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/comm/core.py", line 336, in connect
    comm = await wait_for(
           ^^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/utils.py", line 1872, in wait_for
    async with asyncio.timeout(timeout):
  File "/scratch/vp91/Training/environment/lib/python3.11/asyncio/timeouts.py", line 111, in __aexit__
    raise TimeoutError from exc_val
TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/core.py", line 626, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/utils.py", line 1873, in wait_for
    return await fut
           ^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/nanny.py", line 353, in start_unsafe
    msg = await self.scheduler.register_nanny()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/core.py", line 1362, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/core.py", line 1606, in connect
    return await connect_attempt
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/core.py", line 1527, in _connect
    comm = await connect(
           ^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/comm/core.py", line 362, in connect
    raise OSError(
OSError: Timed out trying to connect to tcp://203.0.19.135:43161 after 30 s

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/cli/dask_worker.py", line 540, in <module>
    main()  # pragma: no cover
    ^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/cli/dask_worker.py", line 446, in main
    asyncio.run(run())
  File "/scratch/vp91/Training/environment/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/scratch/vp91/Training/environment/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/vp91/Training/environment/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/cli/dask_worker.py", line 443, in run
    [task.result() for task in done]
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/cli/dask_worker.py", line 443, in <listcomp>
    [task.result() for task in done]
     ^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/cli/dask_worker.py", line 418, in wait_for_nannies_to_finish
    await asyncio.gather(*nannies)
  File "/scratch/vp91/Training/environment/lib/python3.11/asyncio/tasks.py", line 684, in _wrap_awaitable
    return (yield from awaitable.__await__())
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/core.py", line 634, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Nanny failed to start.

Regards,
Joseph

josephjohnjj · July 22, 2023, 3:48am

I also tried starting a local cluster using

LocalCluster (cluster = LocalCluster(n_workers=48, processes=True , threads_per_worker=1))

but the result is the same:

/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 36919 instead
  warnings.warn(
2023-07-22 13:36:21,288 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/nanny.py", line 432, in instantiate
    result = await self.process.start()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/nanny.py", line 720, in start
    await self.process.start()
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/process.py", line 55, in _call_and_set_future
    res = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/process.py", line 215, in _start
    process.start()
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'generator' object
2023-07-22 13:36:21,355 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/nanny.py", line 432, in instantiate
    result = await self.process.start()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/nanny.py", line 720, in start
    await self.process.start()
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/process.py", line 55, in _call_and_set_future
    res = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/process.py", line 215, in _start
    process.start()
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'generator' object
2023-07-22 13:36:21,359 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/nanny.py", line 432, in instantiate
    result = await self.process.start()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/nanny.py", line 720, in start
    await self.process.start()
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/process.py", line 55, in _call_and_set_future
    res = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/process.py", line 215, in _start
    process.start()
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'generator' object
2023-07-22 13:36:21,363 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/nanny.py", line 432, in instantiate
    result = await self.process.start()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/nanny.py", line 720, in start
    await self.process.start()
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/process.py", line 55, in _call_and_set_future
    res = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/659/jj8451/.local/lib/python3.11/site-packages/distributed/process.py", line 215, in _start
    process.start()
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/scratch/vp91/Training/environment/lib/python3.11/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'generator' object

guillaumeeb · July 25, 2023, 7:55am

We have two different error here.

When using PBSCluster, it looks like a networking problem. Your workers are not able to connect to the Scheduler URL, so probably the Scheduler is not started on the correct network interface. You’ll want to use either the interface kwarg to specify an interface common to the node where your Scheduler is started and the nodes where the Worker are running, ot to just specify the correct host address for the Scheduler.

The second one with LocalCluster is more surprising. I would guess it is some environment error. But wait, are you really using two intricated LocalCluster constructor calls? Don’t you only use:
cluster = LocalCluster(n_workers=48, processes=True , threads_per_worker=1)

Topic		Replies	Views
No jobs sent to workers Distributed delayed , distributed , dask-ml	2	239	February 22, 2024
Future does not get computed in Dask PBS cluster Distributed dask-jobqueue , future , distributed	6	217	August 30, 2023
Setup of Dask on HPC Deploying Dask dask-mpi	3	71	November 8, 2024
Abstract Resources with dask_jobqueue and PBSCluster Distributed dask-jobqueue	2	178	January 24, 2024
Local Cluster with Two Nodes (Desktops) Distributed distributed	1	535	September 21, 2022

Client does not return workers, Job dies quickly

Related topics