Dask Cluster stuck in the pending status and shutdown yourself with Dask Gateway over the Slurm HPC Cluster

menendes · February 2, 2022, 6:58am

Hi Everyone !

When I try to create cluster via dask gateway I getting error like below. When cluster created successfully ; I think it stucks in the pending status and shut down itself automatically.

Logs

My Code

from dask_gateway import Gateway
from dask_gateway import BasicAuth

auth = BasicAuth(username="dask", password="password")

gateway = Gateway("http://10.100.3.99:8000", auth=auth)

print(gateway.list_clusters())

cluster = gateway.new_cluster()
print(gateway.list_clusters())
gateway.close()

dask_gateway_config.py


c.DaskGateway.backend_class = (
    "dask_gateway_server.backends.jobqueue.slurm.SlurmBackend"
)

c.DaskGateway.authenticator_class = "dask_gateway_server.auth.SimpleAuthenticator"
c.SimpleAuthenticator.password = "password"
#c.SimpleAuthenticator.username = "dask"
c.DaskGateway.log_level = 'DEBUG'
#c.DaskGateway.show_config = True
c.SlurmClusterConfig.scheduler_cores = 1
c.SlurmClusterConfig.scheduler_memory = '500 M'
c.SlurmClusterConfig.staging_directory = '{home}/.dask-gateway/'
c.SlurmClusterConfig.worker_cores = 1
c.SlurmClusterConfig.worker_memory = '500 M'
c.SlurmBackend.backoff_base_delay = 0.1
c.SlurmBackend.backoff_max_delay = 300
#c.SlurmBackend.check_timeouts_period = 0.0
c.SlurmBackend.cluster_config_class = 'dask_gateway_server.backends.jobqueue.slurm.SlurmClusterConfig'
c.SlurmBackend.cluster_heartbeat_period = 15
c.SlurmBackend.cluster_start_timeout = 60
c.SlurmBackend.cluster_status_period = 30
c.SlurmBackend.dask_gateway_jobqueue_launcher = '/opt/dask-gateway/miniconda/bin/dask-gateway-jobqueue-launcher'

c.SlurmClusterConfig.adaptive_period = 3
c.SlurmClusterConfig.partition = 'computenodes'

scontrol show job output

menendes · February 15, 2022, 5:39am

Are there anyone who try to dask gateway with slurm cluster ?

guillaumeeb · February 15, 2022, 12:22pm

Hi @menendes,

Sorry, never tried dask-gateway on HPC systems.

By the looks of your logs, it looks like the job starting the Scheduler did not succeed. Either it took too long to start, either it failed, I’m not sure. You should look at the standard output an error of the Scheduler job.

Mays be you should try directly the dask-gateway issue tracker?

menendes · February 15, 2022, 1:31pm

Hi @guillaumeeb,

Actually when I only use slurm cluster I can generate job via sbatch command and I can run it over the slurm cluster. But when I try to create cluster via dask-gateway, it creates a slurm job on the background and I can see it but status is failed like above the screenshot. I open a issue with this link . Thanks for suggestion. If I resolve it I will also write here how to use dask-gateway with slurm hpc cluster.

guillaumeeb · February 15, 2022, 3:32pm

Just note that dask-gateway uses its own mechanism for submitting jobs to Slurm. It does not rely on dask-jobqueue. Anyway, to debug the problem, the tips listed here How to debug — Dask-jobqueue 0.7.4+11.g96e39da.dirty documentation.

menendes · February 16, 2022, 5:36am

I thought that dask-gateway rely on dask-jobqueue but when I examine the source codes of dask-gateway you may right, it runs slurm commands under the hood. So when I only use dask-jobqueue I can make computition over the slurm cluster but in my case I need create more than one dask-cluster for different purposes so I should keep alive the schedulers(dask-gateway component); so for that purpose dask gateway provides me to connect scheduler in any time.

menendes · February 17, 2022, 6:37am

Hi again @guillaumeeb ,
When I check the slurmd which is running on the worker node I notice that some error logs here. Do you have any commend about this error messages ?

Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: Launching batch job 67 for UID 1001
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  AcctGatherEnergy NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  AcctGatherProfile NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  AcctGatherInterconnect NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  AcctGatherFilesystem NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  switch NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  Job accounting gather LINUX plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  cont_id hasn't been set yet not running poll
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  laying out the 1 tasks on 1 hosts testslurmworker1 dist 2
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  Message thread started pid = 41666
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: task affinity plugin loaded with CPU mask 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000>
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  Checkpoint plugin loaded: checkpoint/none
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: Munge credential signature plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  job_container none plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  /etc/slurm-llnl/plugstack.conf: 1: include "/etc/slurm-llnl/plugstack.conf.d/*.conf"
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: error: Could not open stdout file /home/dask/.dask-gateway/2428b456f82a44fdb3c8e57576662e8f/dask-scheduler-2428b456f82a44fdb3c8e57576662e8f.log: >
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: error: IO setup failed: No such file or directory
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  step_terminate_monitor_stop signaling condition
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: job 67 completed with slurm_rc = 0, job_rc = 256
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  Message thread exited
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: done with job
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  _rpc_terminate_job, uid = 64030
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  task_p_slurmd_release_resources: affinity jobid 67
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  credential for job 67 revoked
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  Waiting for job 67's prolog to complete
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  Finished wait for job 67's prolog to complete
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  Calling /usr/sbin/slurmstepd spank epilog
Şub 17 09:03:40 testslurmworker1 spank-epilog[41673]: debug:  Reading slurm.conf file: /etc/slurm-llnl/slurm.conf
Şub 17 09:03:40 testslurmworker1 spank-epilog[41673]: debug:  Running spank/epilog for jobid [67] uid [1001]
Şub 17 09:03:40 testslurmworker1 spank-epilog[41673]: debug:  spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
Şub 17 09:03:40 testslurmworker1 spank-epilog[41673]: debug:  /etc/slurm-llnl/plugstack.conf: 1: include "/etc/slurm-llnl/plugstack.conf.d/*.conf"
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  completed epilog for jobid 67
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  Job 67: sent epilog complete msg: rc = 0

guillaumeeb · February 17, 2022, 7:53pm

Just be careful here, I’m almost certain that dask-gateway on an HPC environment starts Scheduler as jobs, so they’ll have a limited walltime. At some point, your cluster will be destroyed by Slurm. I’m not sure where this walltime is configured though.

I’m not saying that you should do this, but you could also manage several jobqueue.SlurmCluster inside a plain Python script running on a login node. And switch between them with different Clients.

As per the error:

error: Could not open stdout file /home/dask/.dask-gateway/2428b456f82a44fdb3c8e57576662e8f/dask-scheduler-2428b456f82a44fdb3c8e57576662e8f.log

This is saying that the job running the scheduler wants to write its output to this path, but this path is not visible from the node where the job is running. You home is probably not mounted (or does not exist?) on the node running the jobs. You should be able to configure this path, see

github.com

dask/dask-gateway/blob/main/dask-gateway-server/dask_gateway_server/backends/jobqueue/slurm.py#L97


      
                      ]
                  )
                  env = self.get_scheduler_env(cluster)
          
              staging_dir = self.get_staging_directory(cluster)
          
              cmd.extend(
                  [
                      "--chdir=" + staging_dir,
                      "--output=" + os.path.join(staging_dir, log_file),
                      "--cpus-per-task=%d" % cpus,
                      "--mem=%s" % mem,
                      "--export=%s" % (",".join(sorted(env))),
                  ]
              )
          
              return cmd, env, script
          
          def get_stop_cmd_env(self, job_id):
              return [self.cancel_command, job_id], {}

and so

github.com

dask/dask-gateway/blob/7a5f8eb678950729bc8ffbb854a82f47482ddb4b/dask-gateway-server/dask_gateway_server/backends/jobqueue/base.py#L25


      
          
          class JobQueueClusterConfig(ClusterConfig):
              worker_setup = Unicode(
                  "", help="Script to run before dask worker starts.", config=True
              )
          
              scheduler_setup = Unicode(
                  "", help="Script to run before dask scheduler starts.", config=True
              )
          
              staging_directory = Unicode(
                  "{home}/.dask-gateway/",
                  help="""
                  The staging directory for storing files before the job starts.
          
                  A subdirectory will be created for each new cluster which will store
                  temporary files for that cluster. On cluster shutdown the subdirectory
                  will be removed.
          
                  This field can be a template, which receives the following fields:

menendes · February 18, 2022, 7:10am

@guillaumeeb Thank you very much for your responses. I will consider your suggestion. For the error;when I check the job, it is started by the dask user. When I check the related directory which is /home/dask/.dask-gateway is also exist but I could not understand that why the job could not create related directory. Because this directory belongs to dask user. I will investigate it more deeply.

guillaumeeb · February 18, 2022, 3:52pm

Does the directory exist both on the cluster login node and on the node where the job is executed? Could you manually submit a job with dask user and see if you find the directory?

menendes · January 10, 2023, 11:36am

Hi Everyone,

Just to share with community; problem does not related with path or directory. If you use slurm cluster; in the compute nodes you must also install dask-gateway for each compute node and also related user(who is run the job) must access related environment(conda environment).

selvavm · December 22, 2023, 7:58pm

Hi @menendes. I am also facing the same issue. Can you please tell me how I can install in compute node? I am working on cyclecloud and I managed to get the scheduler logs. It says FileNotFoundError in ssl.py. This stackoverflow page says paste the cert but I dont know how to do it in gateway

guillaumeeb · December 27, 2023, 1:37pm

Hi @selvavm, could you open a new topic? Your setup is not the same as @menendes, using CycleCloud is another thing.

Topic		Replies	Views
Can not create cluster with Dask Gateway over the Slurm HPC System Distributed dask-gateway , distributed	7	722	February 4, 2022
Failed to create Slurm cluster by dask-gateway Deploying Dask dask-gateway , distributed	1	233	May 29, 2023
[AKS based cluster with Helm] - Cannot remove PENDING clusters Deploying Dask dask-gateway	3	487	March 4, 2022
Dask Controller (Dask Gateway) Sometimes Hanges Distributed dask-gateway	5	46	October 13, 2024
Dask gateway server shuts down issue Deploying Dask dask-gateway , kubernetes , distributed	1	180	April 27, 2023

Dask Cluster stuck in the pending status and shutdown yourself with Dask Gateway over the Slurm HPC Cluster

Related topics