KubeCluster provisions pod but times out before returning cluster object

I’m trying to use KubeCluster to start a Dask cluster on a remote K3S cluster (i.e. not minikube/kind on localhost). I’m following the example code from the docs which is actually able to provision a worker pod. However, creating the cluster with cluster = KubeCluster('worker-spec.yml') times out before returning the cluster object.

This is the timeout error… Should it be using localhost? Am I on the right track thinking that’s the culprit?

OSError: Timed out during handshake while connecting to tcp://localhost:52729 after 10 s

This is the test code I’m starting with from the KubeCluster docs. As you can see, I’ve tried altering some of the timeout parameters found from a post with a similar issue, but that didn’t help.

import dask
from dask_kubernetes import KubeCluster, KubeConfig, make_pod_spec

auth = KubeConfig(config_file="~/.kube/remote")
# dask.config.set({"kubernetes.scheduler-service-wait-timeout": 300})
# dask.config.set({"distributed.comm.timeouts.connect": 300})

cluster = KubeCluster('worker-spec.yml')
cluster.scale(3)

Note: I’m using the worker-spec.yml from the example in the docs (KubeCluster — Dask Kubernetes 2021.03.0+100.g4a69e3a documentation) so I could update it with the resource limits and requests that match my nodes’ CPU and RAM and define my conda environment in the future.

Some version/setup details:

  • Running Python code in a conda environment on Windows
  • Conda environment using conda-forge:
    • python=3.8.8=h7840368_0_cpython
    • dask=2021.3.0=pyhd8ed1ab_0
    • dask-core=2021.3.0=pyhd8ed1ab_0
    • dask-kubernetes=2022.7.0=pyhd8ed1ab_0
    • kubernetes=1.23.6=h57928b3_0
    • kubernetes-client=1.23.6=h57928b3_0
    • kubernetes-node=1.23.6=h57928b3_0
    • kubernetes_asyncio=24.2.2=pyhd8ed1ab_0
  • K3S cluster:
    • 3 nodes w/ 2CPUs 6GB RAM
    • v1.23.8+k3s2
    • provisioned with Rancher

Yeah it should be using localhost. With you being outside the k3s cluster will will be trying to port forward the scheduler locally, so that’s what you’re seeing.

The first pod it provisions is the scheduler. Could you share the logs from that pod?

Thanks, logs are below… This looks similar to the github issue you were commenting on: KubeCluster(scheduler_service_wait_timeout) parameter is not working. · Issue #367 · dask/dask-kubernetes · GitHub

Also, I deleted the whole cluster and started from scratch, so interesting to know whatever is going on is reproducible.

+ '[' '' ']'
+ '[' '' == true ']'
+ CONDA_BIN=/opt/conda/bin/conda
+ '[' -e /opt/app/environment.yml ']'
+ echo 'no environment.yml'
+ '[' '' ']'
+ '[' git+https://github.com/dask/distributed ']'
+ echo 'EXTRA_PIP_PACKAGES environment variable found.  Installing.'
+ /opt/conda/bin/pip install git+https://github.com/dask/distributed
no environment.yml
EXTRA_PIP_PACKAGES environment variable found.  Installing.
Collecting git+https://github.com/dask/distributed
  Cloning https://github.com/dask/distributed to /tmp/pip-req-build-rqojm0ed
  Running command git clone --filter=blob:none --quiet https://github.com/dask/distributed /tmp/pip-req-build-rqojm0ed
  Resolved https://github.com/dask/distributed to commit c15a10e87ca5d03e62f0ad4f38adb63163522979
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: click>=6.6 in /opt/conda/lib/python3.8/site-packages (from distributed==2022.8.1+6.gc15a10e8) (8.1.3)
Requirement already satisfied: cloudpickle>=1.5.0 in /opt/conda/lib/python3.8/site-packages (from distributed==2022.8.1+6.gc15a10e8) (2.1.0)
Requirement already satisfied: dask==2022.8.1 in /opt/conda/lib/python3.8/site-packages (from distributed==2022.8.1+6.gc15a10e8) (2022.8.1)
Requirement already satisfied: jinja2 in /opt/conda/lib/python3.8/site-packages (from distributed==2022.8.1+6.gc15a10e8) (3.1.2)
Requirement already satisfied: locket>=1.0.0 in /opt/conda/lib/python3.8/site-packages (from distributed==2022.8.1+6.gc15a10e8) (1.0.0)
Requirement already satisfied: msgpack>=0.6.0 in /opt/conda/lib/python3.8/site-packages (from distributed==2022.8.1+6.gc15a10e8) (1.0.4)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.8/site-packages (from distributed==2022.8.1+6.gc15a10e8) (21.3)
Requirement already satisfied: psutil>=5.0 in /opt/conda/lib/python3.8/site-packages (from distributed==2022.8.1+6.gc15a10e8) (5.9.1)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /opt/conda/lib/python3.8/site-packages (from distributed==2022.8.1+6.gc15a10e8) (2.4.0)
Requirement already satisfied: tblib>=1.6.0 in /opt/conda/lib/python3.8/site-packages (from distributed==2022.8.1+6.gc15a10e8) (1.7.0)
Requirement already satisfied: toolz>=0.8.2 in /opt/conda/lib/python3.8/site-packages (from distributed==2022.8.1+6.gc15a10e8) (0.12.0)
Requirement already satisfied: tornado<6.2,>=6.0.3 in /opt/conda/lib/python3.8/site-packages (from distributed==2022.8.1+6.gc15a10e8) (6.1)
Requirement already satisfied: urllib3 in /opt/conda/lib/python3.8/site-packages (from distributed==2022.8.1+6.gc15a10e8) (1.26.11)
Requirement already satisfied: zict>=0.1.3 in /opt/conda/lib/python3.8/site-packages (from distributed==2022.8.1+6.gc15a10e8) (2.2.0)
Requirement already satisfied: pyyaml in /opt/conda/lib/python3.8/site-packages (from distributed==2022.8.1+6.gc15a10e8) (6.0)
Requirement already satisfied: fsspec>=0.6.0 in /opt/conda/lib/python3.8/site-packages (from dask==2022.8.1->distributed==2022.8.1+6.gc15a10e8) (2022.7.1)
Requirement already satisfied: partd>=0.3.10 in /opt/conda/lib/python3.8/site-packages (from dask==2022.8.1->distributed==2022.8.1+6.gc15a10e8) (1.3.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.8/site-packages (from packaging>=20.0->distributed==2022.8.1+6.gc15a10e8) (3.0.9)
Requirement already satisfied: heapdict in /opt/conda/lib/python3.8/site-packages (from zict>=0.1.3->distributed==2022.8.1+6.gc15a10e8) (1.0.1)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.8/site-packages (from jinja2->distributed==2022.8.1+6.gc15a10e8) (2.1.1)
Building wheels for collected packages: distributed
  Building wheel for distributed (setup.py): started
  Building wheel for distributed (setup.py): finished with status 'done'
  Created wheel for distributed: filename=distributed-2022.8.1+6.gc15a10e8-py3-none-any.whl size=894391 sha256=75a2b1a78cde55b342baebf55fcf4d974cd16217b7a218030b5bc0ac8cb04ac4
  Stored in directory: /tmp/pip-ephem-wheel-cache-i55cmqc9/wheels/07/f6/f2/6ea4e4452246d640aa82fd1151ee2053e71463f6386b31d6b1
Successfully built distributed
Installing collected packages: distributed
  Attempting uninstall: distributed
    Found existing installation: distributed 2022.8.1
    Uninstalling distributed-2022.8.1:
      Successfully uninstalled distributed-2022.8.1
Successfully installed distributed-2022.8.1+6.gc15a10e8
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
+ exec dask-scheduler
2022-08-23 19:33:28,738 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-23 19:33:30,041 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-08-23 19:33:30,068 - distributed.scheduler - INFO - State start
2022-08-23 19:33:30,070 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-23 19:33:30,070 - distributed.scheduler - INFO - Clear task state
2022-08-23 19:33:30,070 - distributed.scheduler - INFO -   Scheduler at:      tcp://10.42.1.4:8786
2022-08-23 19:33:30,070 - distributed.scheduler - INFO -   dashboard at:                     :8787
2022-08-23 19:33:43,241 - distributed.comm.tcp - INFO - Connection from tcp://127.0.0.1:35732 closed before handshake completed
2022-08-23 19:33:43,468 - distributed.comm.tcp - INFO - Connection from tcp://127.0.0.1:35734 closed before handshake completed

The scheduler looks like it is running ok. The Connection from ... closed before handshake completed can happen for a number of reasons and generally isn’t the problem.

My guess is that the port forward isn’t being set up correctly. Are you able to do kubectl get svc and then do a kubectl port-forward ... manually to see if you can connect to the scheduler?

That looks like the case, I’m able to port-forward and reach the cluster’s scheduler service.

For a moment, I thought my worker-spec.yml was missing the argument $(DASK_SCHEDULER_ADDRESS). It’s missing in the “Quickstart” section of the docs but used in the “GPUs” section (KubeCluster — Dask Kubernetes 2021.03.0+101.g8bd5083 documentation). However, adding the the scheduler address environment variable to my worker spec args didn’t help.

I also tried changing the default config to use NodePort or LoadBalancer. No luck there either, even with increasing the timeout to 120sec is still get a timeout:

import dask
from dask_kubernetes import KubeCluster, KubeConfig

auth = KubeConfig(config_file="~/.kube/config")
dask.config.set({"kubernetes.scheduler-service-wait-timeout": 120})
dask.config.set({"distributed.comm.timeouts.connect": 120})
dask.config.set({"kubernetes.scheduler-service-type": "LoadBalancer"})

cluster = KubeCluster('worker-spec.yml', auth=auth, deploy_mode='remote')
cluster.scale(3)
TimeoutError: Timed out waiting for Load Balancer to be provisioned.

Note: this error seems to be correct, when I check the provisioning status of the LoadBalancer it’s indefinitely in the provisioning state.

One thing that’s odd is when making the service NodePort is the Python code is trying to reach port 8786 instead of using the NodePort random port of 32367.

Services in Rancher:

Python error:
OSError: Timed out trying to connect to tcp://10.20.0.241:8786 after 10 s

Thanks for the info.

So when using ClusterIP it should call kubectl port-forward for you. I’m a little confused why you are able to do this manually but it is failing here.

When using LoadBalancer it expects an LB to be provisioned. If your cluster doesn’t have a provisioning service configured then it will time out. To resolve this you would need to add something like MetalLB to your cluster. But this probably isn’t the road you need to go down.

When using NodePort it should expose the scheduler on a random high port of every node. If the connection is trying to use 8786 that sounds like a bug. Would you mind opening an issue for this on GitHub?

Sure, would be glad to open a GH issue. In meantime, is there a way I can force KubeCluster to use a port of my choice for NodePort?

P.S. My cluster has Klipper LB so it should work. I’m able to manually create a load balancer service.

Another interesting observation… When I let it use ClusterIP it seems it is creating a port-forward with kubectl. Because when I point a browser to the URL in the error message I get a response.

OSError: Timed out during handshake while connecting to tcp://localhost:53044 after 10 s

I get this response… from localhost:53044 using a browser. I realize this isn’t the dashboard, but actually the scheduler api. I just wanted to see if it was responsive.

=��������������-�������ƒ«compression£lz4¦python“
¯pickle-protocol

It’s interesting that it is timing out but you are able to connect manually. Perhaps the kubernetes.scheduler-service-wait-timeout is not long enough?

I tried increasing that before without luck. I just re-tested again with increased timeouts increased to 60sec and I still get the timeout. The scheduler pod is starting in 2sec and the ClusterIP service starts 8sec after that. What’s interesting is the exception is raised 10sec after the service starts (20sec total elapsed) instead of the 60sec I configured. So it looks like there’s another timeout config I should be considering.

import dask
from dask_kubernetes import KubeCluster, KubeConfig

auth = KubeConfig(config_file="~/.kube/remote")
dask.config.set({"kubernetes.scheduler-service-wait-timeout": 60})
dask.config.set({"distributed.comm.timeouts.connect": 60})
# dask.config.set({"kubernetes.scheduler-service-type": "NodePort"})

cluster = KubeCluster('worker-spec.yml', auth=auth, deploy_mode='remote')

Exception:

Creating scheduler pod on cluster. This may take some time.
distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
  File "c:\tools\miniconda3\envs\daa_sim\lib\site-packages\distributed\protocol\core.py", line 107, in loads
    small_payload = frames.pop()
IndexError: pop from empty list
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
c:\tools\miniconda3\envs\daa_sim\lib\site-packages\distributed\comm\core.py in connect(addr, timeout, deserialize, handshake_overrides, **connection_args)
    318         # write, handshake = await asyncio.gather(comm.write(local_info), comm.read())
--> 319         handshake = await asyncio.wait_for(comm.read(), time_left())
    320         await asyncio.wait_for(comm.write(local_info), time_left())

c:\tools\miniconda3\envs\daa_sim\lib\asyncio\tasks.py in wait_for(fut, timeout, loop)
    493         if fut.done():
--> 494             return fut.result()
    495         else:

c:\tools\miniconda3\envs\daa_sim\lib\site-packages\distributed\comm\tcp.py in read(self, deserializers)
    216 
--> 217                 msg = await from_frames(
    218                     frames,

c:\tools\miniconda3\envs\daa_sim\lib\site-packages\distributed\comm\utils.py in from_frames(frames, deserialize, deserializers, allow_offload)
     79     else:
---> 80         res = _from_frames()
     81 

c:\tools\miniconda3\envs\daa_sim\lib\site-packages\distributed\comm\utils.py in _from_frames()
     62         try:
---> 63             return protocol.loads(
...
--> 324         raise IOError(
    325             f"Timed out during handshake while connecting to {addr} after {timeout} s"
    326         ) from exc

OSError: Timed out during handshake while connecting to tcp://localhost:49996 after 60s