Our usecase:
https://docs.mlerp.cloud.edu.au/
We have been intending to use Dask to act as the interface between a notebook environment and a SLURM Cluster to allow for interactive job submission. The goal being, to create a middle ground that has the interactivity of a notebook with the power of a HPC environment that can share valuable resources between other users while code isn’t being executed - leading to faster turnarounds and more efficient use of hardware.
Users would be provided with a CPU only notebook environment which would be exposed to a SLURM cluster with access to A100 GPUs and other specialised hardware. Rather than having to write small test scripts to be submitted, they would be able to spin up a Dask cluster with whatever compute requirements that they need and release them when they are done.
Our interpretation:
In your documentation you say that “Dask doesn’t need to know” what the code is doing since it “just runs Python functions”. We hoped that this library would allow our researchers to start their work in a small notebook environment, then scale out to a HPC which will contain whatever acceleration is needed.
The blog post:
Your latest blog post however signals that you intend to move away from this model as you are now requiring that. If you use value-add hardware on the client and workers such as GPUs you’ll need to ensure your scheduler has one.
Doesn’t this break your aim to allow users to scale their code “no matter what infrastructure [they] use”?
New Limitations:
The new restrictions on similarity between the scheduler with the client/workers make what we are trying to accomplish a lot more difficult for scientific workloads. For example some of our users may want many small workers with CPU only for highly parallelisable workloads such as preprocessing, while others will want to be allocated a small number of GPU workers such as training.
Requiring similarity in the hardware of the scheduler and the client/workers would mean that a GPU would have to be allocated to the notebook instances in order to work with the GPU acceleration. Since each GPU can only be broken into 7 MIG slices, this significantly limits the number of users of the platform.
It would also make CPU-only work much more difficult as if the scheduler has a GPU, each CPU node would need GPU compute as well.
While we could provide multiple flavours of notebook for each usecase this would require our users to switch environments as they want to run different tests which is a poor user experience. Previously the users would have been able to simply change the requested requirements of their Dask cluster within a notebook cell.
This would also limit our ability to offer whatever xPU support the field needs in the future.
The question:
Is what we are trying to accomplish still going to be supported by this library or is our use case being dropped? Is there a workaround that you would recommend?
Alternatively is it possible for a workaround to be implemented into the library? For example a flag that will turn off this sanity check similar to the ‘no-nanny’ feature that allows for daemonic processes could go a long way.