dask_jobqueue.PBSCluster Scale() PBS Script qsub error

ashleyabraham · April 25, 2022, 7:38pm

I am trying to run Dask PBSCluster on HPC, and when I try to do cluster.scale(10) it errors out with the

Task exception was never retrieved
future: <Task finished name='Task-27' coro=<_wrap_awaitable() done, defined at /user_path/.conda/envs/my_proj/lib/python3.8/asyncio/tasks.py:688> exception=RuntimeError('Command exited with non-zero exit code.\nExit code: 32\nCommand:\nqsub /user_temp_dir/dask_temp/tmp1a30buic.sh\nstdout:\n\nstderr:\nqsub: Error: select statement must be lower case\n\n')>
Traceback (most recent call last):
  File "/user_path/.conda/envs/my_proj/lib/python3.8/asyncio/tasks.py", line 695, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/user_path/.conda/envs/my_proj/lib/python3.8/site-packages/distributed/deploy/spec.py", line 59, in _
    await self.start()
  File "/user_path/.local/lib/python3.8/site-packages/dask_jobqueue/core.py", line 325, in start
    out = await self._submit_job(fn)
  File "/user_path/.local/lib/python3.8/site-packages/dask_jobqueue/core.py", line 308, in _submit_job
    return self._call(shlex.split(self.submit_command) + [script_filename])
  File "/user_path/.local/lib/python3.8/site-packages/dask_jobqueue/core.py", line 403, in _call
    raise RuntimeError(
RuntimeError: Command exited with non-zero exit code.
Exit code: 32
Command:
qsub /dask_temp/tmp1a30buic.sh
stdout:

stderr:
qsub: Error: select statement must be lower case

Here’s one of the pbs script file (tmp1a30buic.sh) created by PBSCluster().scale()

#!/usr/bin/env bash

#PBS -N dask-worker
#PBS -q HIE
#PBS -A xxxxxxxxx
#PBS -l select=1:ncpus=44:mem=100GB
#PBS -l walltime=23:59:59

/user_path/.conda/envs/wwsoil/bin/python3.8 -m distributed.cli.dask_worker tcp://my_ip:43421 --nthreads 4 --nprocs 11 --memory-limit 10GiB --name dummy-name --nanny --death-timeout 60

When I run qsub /dask_temp/tmp1a30buic.sh I get the error above.

Any idea what is causing the error, there is no helpful messages to debug it, any help is appreciated!

ashleyabraham · April 26, 2022, 10:27pm

Well, I found out the HPC I am using needs to have mpiprocs=44 in the select

cluster = PBSCluster(queue='standard', cores=44, processes=10, memory='100GB', 
                     project='xxxxxxxxx', walltime='1:00:00', nanny=True, 
                     resource_spec='select=1:ncpus=44:mpiprocs=44:mem=100gb')

which creates the proper PBS script file

#PBS -q standard
#PBS -l select=1:ncpus=44:mpiprocs:44:mem=100GB

But the error that wasn’t helpful to debug at all. There is no lower case

guillaumeeb · April 27, 2022, 2:46pm

Interesting, and thanks for the solution.

Unfortunately, HPC systems often have some small specific settings like this. I can tell that forcing the use of mpiprocs for a non MPI job in the select statement is unusual!

Topic		Replies	Views
Client does not return workers, Job dies quickly Distributed scheduler	4	365	July 25, 2023
Error when running in dask	1	310	December 8, 2023
AssertionError: Status.running Distributed dask-jobqueue	3	622	March 12, 2022
Abstract Resources with dask_jobqueue and PBSCluster Distributed dask-jobqueue	2	176	January 24, 2024
Scheduler importing modules meant for the workers Deploying Dask scheduler , serialization , distributed	5	256	October 25, 2023

dask_jobqueue.PBSCluster Scale() PBS Script qsub error

Related topics