Missing runner in Dask job

Hi We used Flyte to create Dask task to deploy Dask job on k8s. Generally, it works fine, but sometimes we encountered the Dask scheduler and workers are created and idle. But the runner was not found. Any idea of such issue? Thanks

We only found Dask k8s operator has such error reported:

Handler 'daskjob_create_components/status.jobStatus' failed with an exception. Will retry. Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 168, in call_api response.raise_for_status() File "/usr/local/lib/python3.10/site-packages/httpx/_models.py", line 763, in raise_for_status raise HTTPStatusError(message, request=request, response=self) httpx.HTTPStatusError: Client error '409 Conflict' for url 'https://10.0.0.1/apis/kubernetes.dask.org/v1...", line 774, in daskjob_create_components await cluster.create() File "/usr/local/lib/python3.10/site-packages/kr8s/_objects.py", line 320, in create async with self.api.call_api( File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 186, in call_api raise ServerError( kr8s._exceptions.ServerError: daskclusters.kubernetes.dask.org "fn2oaqa4432x5o-n0-0-dn7-0" already exists

This looks like a bug. Could you open an issue on dask-kubernetes with the full error message?

1 Like

Hi Jacob, i created an issue No exception handle for cluster 'already exists' exception in dask job creation · Issue #940 · dask/dask-kubernetes · GitHub

1 Like

I opened Skip Job cluster creation if already exists by jacobtomlinson · Pull Request #941 · dask/dask-kubernetes · GitHub to close this so I’m going to mark this thread as resolved.