Dask-operator lack of resiliency

Hello, after deploying the dask-operator and creating a sample dask cluster I realized that it creates simple pods, which makes it quite vulnerable to real life day-2 operations like k8s upgrades, node failures, etc…, for example:

  • If for any event a dask worker pod is killed the operator doesn’t recreate it
  • If for any event a dask scheduler pod is killed the operator doesn’t recreate it leaving orphaned dask worker pods wasting resources.

As I am familiar with kopf (framework used to develop the operator) I was wondering why you don’t use deployment specs instead of pod ones, with this simple change k8s control-plane will be able to recover the dask cluster from any eventual pod failure with little impact in current code.
Am I missing something?

Hi @davidp1404, welcome to this forum.

This discussion goes beyond my knowledge, but hopefully @jacobtomlinson will be able to give you some answer.

The DaskCluster resource is intended to be the equivalent to a Deployment but with more communication with the Dask scheduler.

If we used a Deployment and the Pod was lost the deployment would create an identical Pod which the scheduler would see as identical to the old one. However Dask workers are stateful while work is executing and so would not have the state the scheduler expects.

It would be better for the Dask Kubernetes Controller to notice that the expected scale size was not what was expected and to scale up to account for this. That way he scheduler would see the replacement Pod as a new worker with blank state.

We have an open issue tracking this work Worker group doesn't recover after pod deletion · Issue #603 · dask/dask-kubernetes · GitHub

1 Like

Hello @jacobtomlinson, I agree with your reasoning but when using Dask with orchestrators (like Prefect), in my view makes sense to recover the Dask cluster (=using deployments) and let the orcherstrator to manage the impact of failures.

I agree that it makes sense to recover the Dask Cluster. My point is that Deployment’s cannot do that correctly due to the semi-stateful nature of Dask Workers, so we’ve implemented our own controller to handle this.

Right now we have a DaskWorkerGroup resource which behaves similarly to a Deployment but with intelligent scale up/down, however it manages Pods directly. One thing we are considering is wrapping each worker Pod in a ReplicaSet so that the Kubernetes controller handles things like Pod recreation after eviction instead of us having to do it ourselves.