Tight coupling between workers and schedulers

Hi,

Our team is evaluating dask-distributed on kubernetes.

Our ML jobs are essentially containerized python code with each container implementing various models and external libraries (with various python versions).

We have about 60 images, each implementing specific libraries :

  • registry/compute-times-series-prophet:0.1.1
  • registry/compute-times-series-gluonts:0.1.1
    …more

As reading the doc we understand that we can spin jobs with KubeCluster() with make_pod_spec targeting a specific image, as :

pod_spec = make_pod_spec(image='**registry/compute-times-series-gluonts:.0.1.1**',
                         memory_limit='4G', memory_request='4G',
                         cpu_limit=1, cpu_request=1)

cluster = KubeCluster(pod_spec)
...

After installing dask on Minikube we encounter serialization errors with the Client().

Questions :

  1. How pods created with KubeCluster(pod_sec) are coupled with the scheduler in terms of python environments ?
  2. Does the scheduler must share the whole/exact/complete worker environment including tiers libraries such as tensorflow, pytorch, gluonts or just a subset (ex : python version, pickle, msgpack, …) ?
  3. Considering this coupling, should we implement custom schedulers with specific images mirroring the workers => one specific scheduler by worker image ?
  4. Is our approach is compatible with dask-kubernetes (images → pod → client.submit())?
  5. Is there something we don’t understand here ?

Thanks !

Hi @qant.io,

A beginning of answer here but @jacobtomlinson might want to chime in to give more details.

In term of mechanisms, you can perfectly decouple Scheduler and Workers environment if you want, but this is not advisable as you can guess. In fact, as mentioned here, you should have a similar environment between Client, Scheduler and Workers, that might each run in a different machine or pod.
With KubeCluster, either you launch the Scheduler as a pod like the workers (the default now), either you launch the Scheduler as part of the local (Client) environment. In the first case, if you just specify one pod_spec, then this will be OK, Scheduler and Workers will share the same environment.

It’s not mandatory, but it’s strongly recommended. But I guess the most important is between the Client and the Workers.

I’d recommend to use the same image for Scheduler and Workers (does something prevent you to do this?), and be careful with your Client…

I think yes, even if I’ve trouble seeing where you client.submit() is.

I agree with everything @guillaumeeb says. I just want to point you to the new operator which will be replacing the current implementation of KubeCluster. If you are evaluating things I recommend you evaluate this solution.

https://kubernetes.dask.org/en/latest/operator.html

Everything Guillaume said about environments holds true there too.

1 Like