Can not create cluster with Dask Gateway over the Slurm HPC System

Hi Everyone !

I try to create cluster with dask gateway over the Slurm HPC cluster. I follow dask-gateway docs and when I try to create cluster I got an error like below. Could you help me about what I missing ?

Hi @menendes, thanks for the question! The code snippet you provided is helpful, and making it copy-paste-able instead of a screenshot would make it even easier to help with your issue :). I have some preliminary troubleshooting questions:

  1. It seems there is a permissions error, have you already confirmed that the “dask-gateway” user has the correct permissions per these guidelines?
  2. It’s also hard to tell from your screenshot how the server address is configured, does ip:port come from a separate configuration file (per these docs)?

Thanks!

2 Likes

Dear @scharlottej13 , thanks for your response ! Sorry about screenshot :smile: Yes you are right I think it was permission error. I created dask user while I following instruction in the docs. So when I use dask user it is resolved. Now problem evolved getting error like below when I request to create a cluster. I think dask-gateway-jobqueue-launcher could not find dask_gateway_server.managers package but all related packages already installed. Do you have any suggestion about this issue ? Thanks in advance :slight_smile:

[W 2022-02-01 09:06:30.693 DaskGateway] Failed to submit cluster 144673f93ddb47a2aeddca17065bb111
Traceback (most recent call last):
File “/opt/dask-gateway/miniconda/lib/python3.9/site-packages/dask_gateway_server/backends/db_base.py”, line 1248, in _cluster_to_submitted
async for state in self.do_start_cluster(cluster):
File “/opt/dask-gateway/miniconda/lib/python3.9/site-packages/dask_gateway_server/backends/jobqueue/base.py”, line 178, in do_start_cluster
job_id = await self.start_job(
File “/opt/dask-gateway/miniconda/lib/python3.9/site-packages/dask_gateway_server/backends/jobqueue/base.py”, line 118, in start_job
code, stdout, stderr = await self.do_as_user(
File “/opt/dask-gateway/miniconda/lib/python3.9/site-packages/dask_gateway_server/backends/jobqueue/base.py”, line 106, in do_as_user
raise Exception(
Exception: Error running dask-gateway-jobqueue-launcher
returncode: 1
stdout:
stderr: Traceback (most recent call last):
File “/opt/dask-gateway/miniconda/bin/dask-gateway-jobqueue-launcher”, line 7, in
from dask_gateway_server.managers.jobqueue.launcher import main
ModuleNotFoundError: No module named ‘dask_gateway_server.managers’

@menendes As per this GitHub issue, the conda install seems to be having problems. Maybe you can try installing via pip and let us know if it works?

2 Likes

@pavithraes Thank you for your response. Actually I also found this github issue but when I try to install dask-gateway-server-jobqueue package unfortunately it is not found in the PyPI. I am not sure that which packages should installed via pip.

Hi Everyone again !
I passed one step further. I added the solution in the related Github issue. Right now when I try to create a cluster everything seems okay but dask-gateway logs show me that cluster succesfully created and then try to submit a job after that delete the cluster. Related logs in the below.

[I 2022-02-01 16:15:23.636 DaskGateway] Created cluster daf2f09a82a545f3a7f2d72b4f8baecc for user dask
[I 2022-02-01 16:15:23.637 DaskGateway] 201 POST /api/v1/clusters/ 75.438ms
[I 2022-02-01 16:15:23.637 DaskGateway] Submitting cluster daf2f09a82a545f3a7f2d72b4f8baecc...
[I 2022-02-01 16:15:23.846 DaskGateway] Job 7 submitted for cluster daf2f09a82a545f3a7f2d72b4f8baecc
[I 2022-02-01 16:15:23.851 DaskGateway] Cluster daf2f09a82a545f3a7f2d72b4f8baecc submitted
[I 2022-02-01 16:15:40.213 DaskGateway] Cluster daf2f09a82a545f3a7f2d72b4f8baecc failed during startup
[I 2022-02-01 16:15:40.215 DaskGateway] Stopping cluster daf2f09a82a545f3a7f2d72b4f8baecc...
[I 2022-02-01 16:15:40.221 DaskGateway] 200 GET /api/v1/clusters/daf2f09a82a545f3a7f2d72b4f8baecc?wait 16578.348ms
[I 2022-02-01 16:15:40.420 DaskGateway] Cluster daf2f09a82a545f3a7f2d72b4f8baecc stopped
[I 2022-02-01 16:15:40.735 DaskGateway] 200 GET /api/v1/clusters/daf2f09a82a545f3a7f2d72b4f8baecc?wait 1.577ms
[I 2022-02-01 16:15:40.742 DaskGateway] 204 DELETE /api/v1/clusters/daf2f09a82a545f3a7f2d72b4f8baecc 0.674ms

2 Likes

@menendes

I added the solution in the related Github issue.

Thank you, that’s really helpful!

Right now when I try to create a cluster everything seems okay but dask-gateway logs show me that cluster succesfully created and then try to submit a job after that delete the cluster.

Interesting, I’m not able to reproduce this error, but I’ll keep looking into it. I’m wondering if this might be related to the “cluster scale process” detailed in these slides that Jim Crist (creator of dask-gateway) presented recently.

1 Like

@pavithraes
I opened another post to track the related issue. You can find details the last error from this link Thank you also for the presentation I will examine it :slight_smile:

2 Likes