I’ve got a web app that allows users to run containers and I use dask to manage execution. I’d like to be able to leverage some HPC resources but I’m not sure how to keep the scheduler on the web app while having works run on the HPC infrastructure. Because of the HPC firewall, it isn’t really feasible for me to just run dask_jobqueue.SLURMCluster on the cluster and have clients on the web app connect to it. I can setup an ssh tunnel on the cluster so that workers can connect to the scheduler on the web app, but this requires me to manually start jobs that just sit idle until they are needed and requires manual work.
So right now my workaround is to manually start jobs which create an ssh tunnel, then spawn a worker that connects to the scheduler running on the web app. What I would like to do is create an instance of dask_jobqueue.SLURMCluster on the cluster that can then pickup jobs from the scheduler running on the web app.
Hi @mike, thanks for the question! I can certainly empathize with coming up with creative ways to navigate HPC firewalls. I’m not quite sure about the best solution for your use case, but we’ll dig into this a bit and get back to you!
Hello again! It sounds like using Dask Gateway might be one option for you, but this would require some admin-level access for launching SLURM jobs and you’d still need to have network-level access for communicating between the web app and the gateway server. Let us know if you have more questions!
This sounds like a really complex setup to put in place. I’m afraid there won’t be any simple solution here. As mentioned by @scharlottej13, Dask Gateway might help, but I’m not sure it will ensure all your need, as it sounds like to want to be able to submit jobs both on the web server and on the HPC Cluster, is that the case?
If not, then I see several other options (but Dask Gateway might still be the better choice):
Run SLURMCluster inside your webapp, this needs either
an authorized Slurm client on your server and good network config,
Tweak dask-jobqueue to do ssh tunneling for each submission: I’m not sure this is feasible!
Run SLURMCluster on another location that has already access to the Slurm cluster, and connect the client from the Web app to it. This is a bit like Dask-gateway actually, you’ll need some dedicated node or VM to run the service.
Thank you both for the ideas!
I looked at Dask Gateway, but since I don’t have root on the HPC cluster, it isn’t something I can stand up. I thought about running the gateway on the web server, and then the proxy on the HPC cluster, but playing around with the config I realized it is meant more for HPC admins to setup for end users to run dask on the cluster.
Doing ssh tunneling on submission should work, but the issue is I would also need to setup a tunnel so that the SLURMCluster is something the webapp can talk to to submit jobs.
I was looking at the source code for SLURMCluster and it looks like it creates a scheduler to use, would it be feasible that instead of it creating a scheduler, it takes in an address of an already running scheduler?
Another thought I had is to create a script that checks the webserver if any jobs are quened, then serialize the jobs somehow and submit them to SLURMCluster on the hpc cluster, then mark the jobs as done on the scheduler. Nothing needs to be send back to the web-server since the dask tasks have DB access so they can record their results, they just need to know they need to fire off and start.
But no real solution to your problem right now… You might well have to deploy your own. But if you can contribute to dask-jobqueue by the way, please do