When I try to create cluster via dask gateway I getting error like below. When cluster created successfully ; I think it stucks in the pending status and shut down itself automatically.
By the looks of your logs, it looks like the job starting the Scheduler did not succeed. Either it took too long to start, either it failed, I’m not sure. You should look at the standard output an error of the Scheduler job.
Mays be you should try directly the dask-gateway issue tracker?
Actually when I only use slurm cluster I can generate job via sbatch command and I can run it over the slurm cluster. But when I try to create cluster via dask-gateway, it creates a slurm job on the background and I can see it but status is failed like above the screenshot. I open a issue with this link . Thanks for suggestion. If I resolve it I will also write here how to use dask-gateway with slurm hpc cluster.
I thought that dask-gateway rely on dask-jobqueue but when I examine the source codes of dask-gateway you may right, it runs slurm commands under the hood. So when I only use dask-jobqueue I can make computition over the slurm cluster but in my case I need create more than one dask-cluster for different purposes so I should keep alive the schedulers(dask-gateway component); so for that purpose dask gateway provides me to connect scheduler in any time.
Hi again @guillaumeeb ,
When I check the slurmd which is running on the worker node I notice that some error logs here. Do you have any commend about this error messages ?
Just be careful here, I’m almost certain that dask-gateway on an HPC environment starts Scheduler as jobs, so they’ll have a limited walltime. At some point, your cluster will be destroyed by Slurm. I’m not sure where this walltime is configured though.
I’m not saying that you should do this, but you could also manage several jobqueue.SlurmCluster inside a plain Python script running on a login node. And switch between them with different Clients.
As per the error:
error: Could not open stdout file /home/dask/.dask-gateway/2428b456f82a44fdb3c8e57576662e8f/dask-scheduler-2428b456f82a44fdb3c8e57576662e8f.log
This is saying that the job running the scheduler wants to write its output to this path, but this path is not visible from the node where the job is running. You home is probably not mounted (or does not exist?) on the node running the jobs. You should be able to configure this path, see
@guillaumeeb Thank you very much for your responses. I will consider your suggestion. For the error;when I check the job, it is started by the dask user. When I check the related directory which is /home/dask/.dask-gateway is also exist but I could not understand that why the job could not create related directory. Because this directory belongs to dask user. I will investigate it more deeply.
Does the directory exist both on the cluster login node and on the node where the job is executed? Could you manually submit a job with dask user and see if you find the directory?
Just to share with community; problem does not related with path or directory. If you use slurm cluster; in the compute nodes you must also install dask-gateway for each compute node and also related user(who is run the job) must access related environment(conda environment).
Hi @menendes. I am also facing the same issue. Can you please tell me how I can install in compute node? I am working on cyclecloud and I managed to get the scheduler logs. It says FileNotFoundError in ssl.py. This stackoverflow page says paste the cert but I dont know how to do it in gateway