I have implemented a heterogeneous cluster mode for Gateway combining the Kubernetes backend with HTCondor (a batch system for High Throughput Distributed Computing). When our users create clusters, the server launches a Kubernetes Scheduler pod. When scaling/adapting, the user has the choice to request Kubernetes-based workers (pods) or HTCondor-based workers (jobs).
I wrote a library which provides a ‘tweaked’ client, capable of doing batch submissions. As opposed to the current job queue backends, this operation mode delegates the responsibility of submitting jobs and authenticating the user with HTCondor to the client (not the user).
Once our HTCondor job reaches a slot, it launches a pre-configured dask-worker
container (Singularity) to join the scheduler via the tls://
endpoint exposed by Traefik. This effectively allows us to expand Gateway analyses into our batch farms, which have considerably more capacity than our Kubernetes cluster.
Is this something of interest to Dask/Gateway developers? We’d be happy to collaborate but so far have not found a clear path towards that. Hoping this post reaches the right people!
Maria P. Acosta
(Fermilab)