Failed to create Slurm cluster by dask-gateway

Hi,

when I try to create cluster by dask-gateway , I getting error “GatewayClusterError: Cluster ‘a4c9a885b4e947f399f33b0441851784’ failed to start, see logs for more information”, However, I cannot find any log files related to this error. Here is my configuration file for dask-gateway-server:

.DaskGateway.backend_class = (
    "dask_gateway_server.backends.jobqueue.slurm.SlurmBackend"
)

#c.DaskGateway.authenticator_class = "dask_gateway_server.auth.SimpleAuthenticator"
c.SimpleAuthenticator.password = "password"
#c.SimpleAuthenticator.username = "dask"
c.DaskGateway.log_level = 'DEBUG'
#c.DaskGateway.show_config = True
c.SlurmClusterConfig.scheduler_cores = 6
c.SlurmClusterConfig.scheduler_memory = '2048 M'
c.SlurmClusterConfig.staging_directory = '{home}/.dask-gateway/'
c.SlurmClusterConfig.worker_cores = 2
c.SlurmClusterConfig.worker_memory = '4096 G'
c.SlurmBackend.backoff_base_delay = 0.1
c.SlurmBackend.backoff_max_delay = 300
#c.SlurmBackend.check_timeouts_period = 0.0
c.SlurmBackend.cluster_config_class = 'dask_gateway_server.backends.jobqueue.slurm.SlurmClusterConfig'
c.SlurmBackend.cluster_heartbeat_period = 15
c.SlurmBackend.cluster_start_timeout = 600
c.SlurmBackend.worker_start_timeout = 600
c.SlurmBackend.cluster_status_period = 30
c.SlurmBackend.dask_gateway_jobqueue_launcher = '/opt/dask-gateway/anaconda/bin/dask-gateway-jobqueue-launcher'

c.JobQueueClusterConfig.scheduler_setup = 'source /opt/dask-gateway/anaconda/bin/activate /opt/dask'
c.JobQueueClusterConfig.worker_setup = 'source /opt/dask-gateway/anaconda/bin/activate /opt/dask'

c.SlurmClusterConfig.adaptive_period = 3
#c.SlurmClusterConfig.partition = 'compute'

c.clusterConfig.worker_cores = 2

When I try to call gateway.new_cluster(), the output on the console from dask-gateway-server is as follows:

sudo -iu dask /opt/dask-gateway/start-dask-gateway
[I 2023-05-25 18:30:15.681 DaskGateway] Starting dask-gateway-server - version 2023.1.1
[I 2023-05-25 18:30:15.840 DaskGateway] Authenticator: 'dask_gateway_server.auth.SimpleAuthenticator'
[I 2023-05-25 18:30:15.840 DaskGateway] Backend: 'dask_gateway_server.backends.jobqueue.slurm.SlurmBackend'
[I 2023-05-25 18:30:15.843 DaskGateway] Generating new api token for proxy
[I 2023-05-25 18:30:15.843 DaskGateway] Starting the Dask gateway proxy...
[I 2023-05-25 18:30:15.844 DaskGateway] Dask gateway proxy started
[I 2023-05-25 18:30:15.845 DaskGateway] - HTTP routes listening at http://:8000
[I 2023-05-25 18:30:15.845 DaskGateway] - Scheduler routes listening at gateway://:8000
[W 2023-05-25 18:30:15.850 Proxy] Unexpected failure fetching routing table, retrying in 0.5s: Get "http://127.0.0.1:51509/api/v1/routes": dial tcp 127.0.0.1:51509: connect: connection refused
[I 2023-05-25 18:30:15.860 DaskGateway] Backend started, clusters will contact api server at http://local3:8000/api
[D 2023-05-25 18:30:15.861 DaskGateway] Removed 0 expired clusters from the database
[I 2023-05-25 18:30:15.862 DaskGateway] Dask-Gateway server started
[I 2023-05-25 18:30:15.862 DaskGateway] - Private API server listening at http://127.0.0.1:51509
[W 2023-05-25 18:30:27.163 DaskGateway] 401 POST /api/v1/clusters/ 0.674ms
[I 2023-05-25 18:30:27.329 DaskGateway] Created cluster a4c9a885b4e947f399f33b0441851784 for user dask
[I 2023-05-25 18:30:27.329 DaskGateway] 201 POST /api/v1/clusters/ 163.219ms
[D 2023-05-25 18:30:27.329 DaskGateway] Reconciling cluster a4c9a885b4e947f399f33b0441851784, CREATED -> RUNNING
[I 2023-05-25 18:30:27.329 DaskGateway] Submitting cluster a4c9a885b4e947f399f33b0441851784...
[I 2023-05-25 18:30:27.634 DaskGateway] Job 165 submitted for cluster a4c9a885b4e947f399f33b0441851784
[D 2023-05-25 18:30:27.634 DaskGateway] State update for cluster a4c9a885b4e947f399f33b0441851784
[I 2023-05-25 18:30:27.638 DaskGateway] Cluster a4c9a885b4e947f399f33b0441851784 submitted
[D 2023-05-25 18:30:30.865 DaskGateway] Checking for timed out clusters/workers
[D 2023-05-25 18:30:45.876 DaskGateway] Checking pending cluster statuses
[D 2023-05-25 18:30:45.877 DaskGateway] Checking status of 1 jobs
[D 2023-05-25 18:30:45.878 DaskGateway] Checking pending worker statuses
[D 2023-05-25 18:30:45.879 DaskGateway] Checking for timed out clusters/workers
[I 2023-05-25 18:30:45.889 DaskGateway] Cluster a4c9a885b4e947f399f33b0441851784 failed during startup
[D 2023-05-25 18:30:45.889 DaskGateway] Reconciling cluster a4c9a885b4e947f399f33b0441851784, SUBMITTED -> FAILED
[D 2023-05-25 18:30:45.890 DaskGateway] Preparing to stop cluster a4c9a885b4e947f399f33b0441851784
[D 2023-05-25 18:30:45.891 DaskGateway] Reconciling cluster a4c9a885b4e947f399f33b0441851784, CLOSING -> FAILED
[I 2023-05-25 18:30:45.891 DaskGateway] Stopping cluster a4c9a885b4e947f399f33b0441851784...
[I 2023-05-25 18:30:45.893 DaskGateway] 200 GET /api/v1/clusters/a4c9a885b4e947f399f33b0441851784?wait 18560.218ms
[I 2023-05-25 18:30:46.174 DaskGateway] Cluster a4c9a885b4e947f399f33b0441851784 stopped
[I 2023-05-25 18:30:46.514 DaskGateway] 200 GET /api/v1/clusters/a4c9a885b4e947f399f33b0441851784?wait 0.940ms
[I 2023-05-25 18:30:46.521 DaskGateway] 204 DELETE /api/v1/clusters/a4c9a885b4e947f399f33b0441851784 0.470ms

And the Slurm job information is below:

scontrol show job
JobId=166 JobName=dask-gateway
   UserId=dask(1000) GroupId=dask(1000) MCS_label=N/A
   Priority=4294901750 Nice=0 Account=(null) QOS=(null)
   JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=1:0
   RunTime=00:00:01 TimeLimit=365-00:00:00 TimeMin=N/A
   SubmitTime=2023-05-25T18:43:20 EligibleTime=2023-05-25T18:43:20
   AccrueTime=2023-05-25T18:43:20
   StartTime=2023-05-25T18:43:20 EndTime=2023-05-25T18:43:21 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-05-25T18:43:20
   Partition=compute AllocNode:Sid=local3:4597
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=local2
   BatchHost=local2
   NumNodes=1 NumCPUs=12 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:*
   TRES=cpu=12,node=1,billing=12
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=6 MinMemoryNode=2G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/dask/.dask-gateway/c32a20a1176f44fd972465b9228920e0
   StdErr=/home/dask/.dask-gateway/c32a20a1176f44fd972465b9228920e0/dask-scheduler-c32a20a1176f44fd972465b9228920e0.log
   StdIn=/dev/null
   StdOut=/home/dask/.dask-gateway/c32a20a1176f44fd972465b9228920e0/dask-scheduler-c32a20a1176f44fd972465b9228920e0.log
   Power=
   NtasksPerTRES:0

Can anyone provide any suggestions? I have spent a lot of time on this issue and I don’t want to give up, but without any log about this error, I have no clues to troubleshoot the problem. I would appreciate any advice and help from you.

Hi @woestler,

Could you also print the stderr and stdout of the Slurm job? There might be some clue about why the cluster failed to start here.