Heterogeneous clusters (Kubernetes + HPC workers) with Dask Gateway?

I have implemented a heterogeneous cluster mode for Gateway combining the Kubernetes backend with HTCondor (a batch system for High Throughput Distributed Computing). When our users create clusters, the server launches a Kubernetes Scheduler pod. When scaling/adapting, the user has the choice to request Kubernetes-based workers (pods) or HTCondor-based workers (jobs).

I wrote a library which provides a ‘tweaked’ client, capable of doing batch submissions. As opposed to the current job queue backends, this operation mode delegates the responsibility of submitting jobs and authenticating the user with HTCondor to the client (not the user).

Once our HTCondor job reaches a slot, it launches a pre-configured dask-worker container (Singularity) to join the scheduler via the tls:// endpoint exposed by Traefik. This effectively allows us to expand Gateway analyses into our batch farms, which have considerably more capacity than our Kubernetes cluster.

Is this something of interest to Dask/Gateway developers? We’d be happy to collaborate but so far have not found a clear path towards that. Hoping this post reaches the right people!

Maria P. Acosta
(Fermilab)

1 Like

@mapsacosta Welcome to Discourse, and thank you for sharing and for your interest in collaborating!

Is this something of interest to Dask/Gateway developers?

I think dask-gateway may not be interested at the moment, because:

  • the developers are focusing on maintenance and adding new features is a low priority right now, and
  • this feature might be too specific to your use case to be upstreamed

However, I’ll keep this in mind in case there is an opportunity in the future! And, again, thank you for reaching out, we appreciate it! :smile: