I’m trying to use KubeCluster to start a Dask cluster on a remote K3S cluster (i.e. not minikube/kind on localhost). I’m following the example code from the docs which is actually able to provision a worker pod. However, creating the cluster with cluster = KubeCluster('worker-spec.yml') times out before returning the cluster object.
This is the timeout error… Should it be using localhost? Am I on the right track thinking that’s the culprit?
OSError: Timed out during handshake while connecting to tcp://localhost:52729 after 10 s
This is the test code I’m starting with from the KubeCluster docs. As you can see, I’ve tried altering some of the timeout parameters found from a post with a similar issue, but that didn’t help.
Note: I’m using the worker-spec.yml from the example in the docs (KubeCluster — Dask Kubernetes 2021.03.0+100.g4a69e3a documentation) so I could update it with the resource limits and requests that match my nodes’ CPU and RAM and define my conda environment in the future.
Some version/setup details:
Running Python code in a conda environment on Windows
Yeah it should be using localhost. With you being outside the k3s cluster will will be trying to port forward the scheduler locally, so that’s what you’re seeing.
The first pod it provisions is the scheduler. Could you share the logs from that pod?
The scheduler looks like it is running ok. The Connection from ... closed before handshake completed can happen for a number of reasons and generally isn’t the problem.
My guess is that the port forward isn’t being set up correctly. Are you able to do kubectl get svc and then do a kubectl port-forward ... manually to see if you can connect to the scheduler?
That looks like the case, I’m able to port-forward and reach the cluster’s scheduler service.
For a moment, I thought my worker-spec.yml was missing the argument $(DASK_SCHEDULER_ADDRESS). It’s missing in the “Quickstart” section of the docs but used in the “GPUs” section (KubeCluster — Dask Kubernetes 2021.03.0+101.g8bd5083 documentation). However, adding the the scheduler address environment variable to my worker spec args didn’t help.
I also tried changing the default config to use NodePort or LoadBalancer. No luck there either, even with increasing the timeout to 120sec is still get a timeout:
One thing that’s odd is when making the service NodePort is the Python code is trying to reach port 8786 instead of using the NodePort random port of 32367.
So when using ClusterIP it should call kubectl port-forward for you. I’m a little confused why you are able to do this manually but it is failing here.
When using LoadBalancer it expects an LB to be provisioned. If your cluster doesn’t have a provisioning service configured then it will time out. To resolve this you would need to add something like MetalLB to your cluster. But this probably isn’t the road you need to go down.
When using NodePort it should expose the scheduler on a random high port of every node. If the connection is trying to use 8786 that sounds like a bug. Would you mind opening an issue for this on GitHub?
Sure, would be glad to open a GH issue. In meantime, is there a way I can force KubeCluster to use a port of my choice for NodePort?
P.S. My cluster has Klipper LB so it should work. I’m able to manually create a load balancer service.
Another interesting observation… When I let it use ClusterIP it seems it is creating a port-forward with kubectl. Because when I point a browser to the URL in the error message I get a response.
OSError: Timed out during handshake while connecting to tcp://localhost:53044 after 10 s
I get this response… from localhost:53044 using a browser. I realize this isn’t the dashboard, but actually the scheduler api. I just wanted to see if it was responsive.
I tried increasing that before without luck. I just re-tested again with increased timeouts increased to 60sec and I still get the timeout. The scheduler pod is starting in 2sec and the ClusterIP service starts 8sec after that. What’s interesting is the exception is raised 10sec after the service starts (20sec total elapsed) instead of the 60sec I configured. So it looks like there’s another timeout config I should be considering.
Creating scheduler pod on cluster. This may take some time.
distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
File "c:\tools\miniconda3\envs\daa_sim\lib\site-packages\distributed\protocol\core.py", line 107, in loads
small_payload = frames.pop()
IndexError: pop from empty list
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
c:\tools\miniconda3\envs\daa_sim\lib\site-packages\distributed\comm\core.py in connect(addr, timeout, deserialize, handshake_overrides, **connection_args)
318 # write, handshake = await asyncio.gather(comm.write(local_info), comm.read())
--> 319 handshake = await asyncio.wait_for(comm.read(), time_left())
320 await asyncio.wait_for(comm.write(local_info), time_left())
c:\tools\miniconda3\envs\daa_sim\lib\asyncio\tasks.py in wait_for(fut, timeout, loop)
493 if fut.done():
--> 494 return fut.result()
495 else:
c:\tools\miniconda3\envs\daa_sim\lib\site-packages\distributed\comm\tcp.py in read(self, deserializers)
216
--> 217 msg = await from_frames(
218 frames,
c:\tools\miniconda3\envs\daa_sim\lib\site-packages\distributed\comm\utils.py in from_frames(frames, deserialize, deserializers, allow_offload)
79 else:
---> 80 res = _from_frames()
81
c:\tools\miniconda3\envs\daa_sim\lib\site-packages\distributed\comm\utils.py in _from_frames()
62 try:
---> 63 return protocol.loads(
...
--> 324 raise IOError(
325 f"Timed out during handshake while connecting to {addr} after {timeout} s"
326 ) from exc
OSError: Timed out during handshake while connecting to tcp://localhost:49996 after 60s