Kubernetes Operator + AutoScaler Losing Tasks / Workers

tasansal · December 5, 2022, 10:34pm

Hi everyone, we are trying out the new Kubernetes Operator based KubeCluster for a workload that needs adaptive scaling. So far it is amazing, however, we have an issue with adaptive scaling which causes tasks to get lost. Here is how we provision our cluster and connect to it:

import distributed
from dask_kubernetes.operator import KubeCluster

cluster = KubeCluster(
    name="my-dask-cluster",
    image="ghcr.io/dask/dask:2022.12.0-py3.10",
    n_workers=4,
)

cluster.adapt(4, 32)
client = distributed.Client(cluster)

which creates the following K8s objects as expected:

I run a simple Dask Array workflow to generate pseudo-random numbers and get a slice out to my client like this:

import dask.array as da

arr = da.random.normal(size=(4096, 16384, 4096), chunks=(128, 512, 512)).astype('float32')
arr_slice = arr[0].compute()

When we watch the dashboard it all looks fine until we start losing workers or tasks:

Tasks start running on the 4 base workers.
More workers spin up and get added to the cluster.
As the workers keep coming, some of them start dying, the works get re-assigned.
Eventually the workers either finish the job if we are lucky or we error out with this:

KilledWorker: Attempted to run task ('astype-normal-getitem-60d7c97bdab5f20901029ce4f6d7bc5c', 15, 5) on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://10.19.33.132:46341. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.

We followed the link, but it doesn’t apply to any of the issues we are seeing. We can’t recover any information about why and how it dies.

Given this is such a trivial example, we are puzzled by why this is happening. We checked Dask K8s and dashboard logs and typically, it says “Removing worker …” followed by adding new workers etc.

In some cases, we get an exception thrown, saying the tasks weren’t able to be retrieved from the workers. Are we doing something wrong here, or is there a bug?

We have also tried the recreate_task_locally() function, but it doesn’t return any information. It actually returns some values as a numpy array. Also confusing

A little more information about our setup:

Google Cloud
Google Kubernetes Engine
We turned on Autopilot
Running behind a VPC
Using the default ClusterIP service or internal LoadBalancer yields to same results.
If we don’t turn on adaptive scaling; scale by hand to many (hundreds) of pods, everything works fine.
We can scale up or down manually without any issues.
Operator, Dask, Distributed all at 2022.12.0.
Everything on client, scheduler, and workers are on the same version.
LocalCluster works fine too.

We think Autpilot may be the culprit here, but in theory, it shouldn’t conflict with anything Dask is doing. It just manages the node pools that Dask asks for etc. However, we are about to try a new deployment without autopilot to test if this is the culprit.

Thanks in advance!
Altay

jacobtomlinson · December 6, 2022, 2:44pm

We can’t recover any information about why and how it dies.

Can you watch the scheduler and worker pod logs on your Kubernetes cluster? This will be the best place to figure out what is going on.

tasansal · December 8, 2022, 5:58pm

Hi @jacobtomlinson,

We have been trying, but hard to get to Worker logs after they have been terminated.

However, on the scheduler this is the error:

2022-12-08 17:56:15,183 - distributed.scheduler - ERROR - Shut down workers that don't have promised key: [], ('transpose-37d5309538e5918c286a45e5863fa247', 6, 23)

Sounds like Scheduler is killing them because the workers lose the keys somehow? These are only happening on newly added worker pods with the DaskAutoScaler (i.e. .adapt). If I manually scale without AutoScaler this is not an issue.

I will post more updates if we can get to the worker logs

tasansal · December 8, 2022, 11:48pm

@jacobtomlinson

Another interesting thing is, the Classic KubeCluster works fine. The one with operator doesn’t.

We are still investigating.

I wonder if anyone in the community can reproduce this with the operator.

tasansal · December 9, 2022, 10:27pm

I can reproduce this with minikube on a local Kubernetes cluster just by using the demo from the documentation. I don’t think it is related to our GCP deployment.

I will open an issue at the repo.

tasansal · December 12, 2022, 2:19pm

github.com/dask/dask-kubernetes

Adaptive `KubeCluster` with Operator Losing Work and Unexpectedly Scaling Up and Down.

opened 09:57PM - 11 Dec 22 UTC

tasansal

Hi everyone! Thanks for all the effort on the Kubernetes operator for Dask. It i…s definitely the way to go. However, we ran into a problem that negatively affects our workload. The same minimally reproducible code runs fine with the classic `KubeCluster` but fails with the new one. The culprit seems to be the `DaskAutoScaler`. More details are below. We were able to reproduce this on a real cloud deployment on Google Cloud using the Kubernetes Engine, and using `minikube` in a local developer deployment. **Describe the issue** We are deploying the `Dask Kubernetes Operator` on our GKE cluster and turning on auto-scaling. However, with the simple example below, the auto scaler is having a very rough time. The pathology of issues we see: Preface: I will use `adapt_min`, and `adapt_max` to denote min/max values for the adaptive scaler. 1. Scheduler is idle and has `adapt_min` workers. 2. Workflow gets submitted. 3. New pods start spinning up, **sometimes existing workers that are ready to go die while scaling up** 4. New pods start getting added to the scheduler. Sometimes it reaches `adapt_max`, sometimes it doesn't. Even with the same code. 5. Most of the time, the number of workers varies between `adapt_min` and `adapt_max`, even if the workload takes more than 30 seconds. In the middle of the progress, it can kill workers and restart them. 6. Sometimes, the compute finishes and takes much longer because the adaptive scaler killed and re-created workers with data. 7. Sometimes it doesn't finish with the error: Scheduler wited for `Key(...)`, but it wasn't available; reassigning. 8. Once workflow dies, it usually kills more workers than it should, and we end up with less than `adapt_min`. 9. Sometimes takes a long time to recover from bullet 8 and go back to `adapt_min` or even to scale again. **Minimal Complete Verifiable Example** Assuming `kubectl` is set up, and the operator is installed. We follow the instructions to install the operator from [here](https://kubernetes.dask.org/en/latest/operator_installation.html) via Helm and it has the right permissions. Creating the Adaptive Dask Cluster and connect from Client via: ```python from dask_kubernetes.operator import KubeCluster from distributed import Client cluster = KubeCluster( name="simple", image="ghcr.io/dask/dask:2022.12.0-py3.10", n_workers=4, worker_command=["dask", "worker", "--nthreads", "1"], ) cluster.adapt(4, 30) client = Client(cluster) ``` Workload; generate a random 3D array and take a slice from it. This mimicks our workload of reading a slice of a large `Zarr` dataset: ```python import dask.array as da rand_vol = da.random.normal( size=(2048, 8192, 4096), chunks=(128, 512, 256), ).astype('float16') results = rand_vol[0].compute() ``` Here is a YouTube video I recorded. Creating cluster once and running the same workload multiple times: [![Watch the video](https://img.youtube.com/vi/GrIzDSZAR-w/maxresdefault.jpg)](https://youtu.be/GrIzDSZAR-w) Run details below. Run 6 is the worst case. One level above that is we start getting key errors, which is reproducible with more workers easily. **Run 1:** Never scales to 30 workers. Seems ok until the end, but then it scales down to 4 workers, rewinds the task graph, and runs some more jobs until it returns the data to the scheduler/client. Not sure what's happening here. **Run 2:** Never scales above 19 workers. Same as run 1, it scales down to `adapt_min` workers and then reruns some. When we are done, we have 2 workers, which is 2 less than `adapt_min`. We lost workers for some reason. It doesn't heal. **Run 3:** We scale up to `adapt_max`! Which is good. But in the end, it scales down to `adapt_min` and runs more jobs. Then ALL workers die and we are left with 0 workers. **Run 4:** Executed while we have 0 workers. Then it scales up to 25 workers; and same issues happen as runs 1, 2, and 3. Again at the end we end up with 0 workers. **Run 5:** We wait a little bit and some workers show up. Now it is 5; while we expect `adapt_min` of 4. We execute it and now it scales up to 29 workers. One less than `adapt_max`. Rewinds at the end and runs some more jobs. Back to 4 workers as expected. **Run 6:** Scales up to 8 workers, which is 22 less than `adapt_max`. Stays there for a long time. Then it scales down (??) to 2 workers which is two less than `adapt_min`. Loses work, rewinds the progress, then scales up to 16 workers. Then up to 23, then back down to 21. Then at the end, back to 4 workers, and runs more jobs, after all 512 were complete. Then we end up with 8 workers at idle; which is four more than `adapt_min`. The same workload with the classic KubeCluster doesn't have any of these issues. Every single time, it scales up to 30 workers, runs it, then retires the workers. Back to 4. Creating classic `KubeCluster`: ```python from dask_kubernetes.classic import KubeCluster, make_pod_spec pod_template = make_pod_spec(image="ghcr.io/dask/dask:2022.12.0-py3.10") cluster = KubeCluster(pod_template=pod_template, name="simple", n_workers=4) cluster.adapt(minimum=4, maximum=30) client = Client(cluster) ``` **Anything else we need to know?** * This issue gets worse with a higher number of workers. I kept it at 30 for cheap reproducibility. * Manually scaling with `kubectl` works as expected, even while a workload is running. * Deployment via `cluster.yaml` and `autoscaler.yaml` yields to same results. * After scaling up, if we remove the `DaskAutoScaler`, it keeps working fine. (with `kubectl delete -f autoscaler.yaml`) * As soon as we add the `DaskAutoScaler`, it starts happening again. (with `kubectl apply -f autoscaler.yaml`) * We tried this on Google Kubernetes Engine `GKE` and locally with `minikube`. Same behavior. * The classic `KubeCluster` works as expected. This issue doesn't happen. **Environment** - Dask version: 2022.12.0 - Python version: 3.10 - Operating System: Linux and Windows - Install method (conda, pip, source): conda and docker

Topic		Replies	Views
HelmCluster + Autoscaling doesnt seem to scale the workers Distributed kubernetes	3	28	July 15, 2024
Customize Autoscaling using dask's kubernetes operator Deploying Dask kubernetes	3	191	March 14, 2024
Shuffle P2P unstable with adaptive k8s operator? Distributed kubernetes , scheduler	3	121	March 14, 2024
Dask Auto Scaler failing to create Deploying Dask	1	193	July 19, 2023
Migrating from Classic to Operator: cross-K8s cluster schedulers Distributed	6	403	March 31, 2023

Kubernetes Operator + AutoScaler Losing Tasks / Workers

Related topics