Hi,
I manage deployments of Kubernetes on Openstack (XSEDE Jetstream) and we do not have Load Balancers.
In 2020 I managed to setup Dask Gateway 0.8.0 to be exposed as a JupyterHub service and it worked fine, I wrote a tutorial about that for fellow XSEDE users:
https://zonca.dev/2020/08/dask-gateway-jupyterhub.html
Now I am trying to update the tutorial to the latest JupyterHub and to Dask Gateway 0.9.0 but I cannot make it working.
In the JupyterHub configuration I have:
hub:
services:
dask-gateway:
# This makes the gateway available at ${HUB_URL}/services/dask-gateway
url: http://traefik-dask-gateway
In fact if I check:
https://js-xxx-xxx.jetstream-cloud.org/services/dask-gateway/api/health
I get {status:pass}
.
I access the gateway with:
gateway = Gateway(
address="http://traefik-dask-gateway/services/dask-gateway/",
public_address="https://js-xxx-xxx.jetstream-cloud.org/services/dask-gateway/",
auth="jupyterhub")
I can:
- create a new cluster
- scale it up
- access the dashboard
But it fails when I try to get the client:
>>> cluster.get_client()
---------------------------------------------------------------------------
error Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/comm/core.py in connect(addr, timeout, deserialize, handshake_overrides, **connection_args)
319 # write, handshake = await asyncio.gather(comm.write(local_info), comm.read())
--> 320 handshake = await asyncio.wait_for(comm.read(), time_left())
321 await asyncio.wait_for(comm.write(local_info), time_left())
/srv/conda/envs/notebook/lib/python3.8/asyncio/tasks.py in wait_for(fut, timeout, loop)
493 if fut.done():
--> 494 return fut.result()
495 else:
/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/comm/tcp.py in read(self, deserializers)
215 try:
--> 216 frames = unpack_frames(frames)
217
/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/protocol/utils.py in unpack_frames(b)
69
---> 70 (n_frames,) = struct.unpack_from(fmt, b)
71 lengths = struct.unpack_from(f"{n_frames}{fmt}", b, fmt_size)
error: unpack_from requires a buffer of at least 8 bytes for unpacking 8 bytes at offset 0 (actual buffer size is 2)
The above exception was the direct cause of the following exception:
OSError Traceback (most recent call last)
<ipython-input-12-affca45186d3> in <module>
----> 1 client = cluster.get_client()
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in get_client(self, set_as_default)
1076 client : dask.distributed.Client
1077 """
-> 1078 client = Client(
1079 self,
1080 security=self.security,
/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in __init__(self, address, loop, timeout, set_as_default, scheduler_file, security, asynchronous, name, heartbeat_interval, serializers, deserializers, extensions, direct_to_workers, connection_limit, **kwargs)
752 ext(self)
753
--> 754 self.start(timeout=timeout)
755 Client._instances.add(self)
756
/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in start(self, **kwargs)
965 self._started = asyncio.ensure_future(self._start(**kwargs))
966 else:
--> 967 sync(self.loop, self._start, **kwargs)
968
969 def __await__(self):
/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
352 if error[0]:
353 typ, exc, tb = error[0]
--> 354 raise exc.with_traceback(tb)
355 else:
356 return result[0]
/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/utils.py in f()
335 if callback_timeout is not None:
336 future = asyncio.wait_for(future, callback_timeout)
--> 337 result[0] = yield future
338 except Exception as exc:
339 error[0] = sys.exc_info()
/srv/conda/envs/notebook/lib/python3.8/site-packages/tornado/gen.py in run(self)
760
761 try:
--> 762 value = future.result()
763 except Exception:
764 exc_info = sys.exc_info()
/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in _start(self, timeout, **kwargs)
1055
1056 try:
-> 1057 await self._ensure_connected(timeout=timeout)
1058 except (OSError, ImportError):
1059 await self._close()
/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in _ensure_connected(self, timeout)
1112
1113 try:
-> 1114 comm = await connect(
1115 self.scheduler.address, timeout=timeout, **self.connection_args
1116 )
/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/comm/core.py in connect(addr, timeout, deserialize, handshake_overrides, **connection_args)
323 with suppress(Exception):
324 await comm.close()
--> 325 raise IOError(
326 f"Timed out during handshake while connecting to {addr} after {timeout} s"
327 ) from exc
OSError: Timed out during handshake while connecting to gateway://traefik-dask-gateway:80/jhub.72633e218e6a42d2830183f9535efc10 after 10 s
cluster.scheduler_address
is 'gateway://traefik-dask-gateway:80/jhub.72633e218e6a42d2830183f9535efc10'
Maybe I need to provide a proxy_address
to the Gateway
class? Or is there a better way to achieve this? Also suggestions on how to better understand the issue are much appreciated.
There are also some errors in the Traefik logs, not sure if related:
time="2022-01-10T05:55:08Z" level=error msg="Cannot create service: subset not found" ingress=dask-767979700711489ebed89b627848c82c servicePort=8786 providerName=kubernetescrd serviceName=dask-767979700711489ebed89b627848c82c namespace=jhub
time="2022-01-10T05:55:10Z" level=error msg="subset not found for jhub/dask-767979700711489ebed89b627848c82c" providerName=kubernetescrd namespace=jhub ingress=dask-767979700711489ebed89b627848c82c
time="2022-01-10T05:55:10Z" level=error msg="Cannot create service: subset not found" providerName=kubernetescrd servicePort=8786 ingress=dask-767979700711489ebed89b627848c82c namespace=jhub serviceName=dask-767979700711489ebed89b627848c82c
time="2022-01-10T05:55:12Z" level=error msg="subset not found for jhub/dask-767979700711489ebed89b627848c82c" providerName=kubernetescrd ingress=dask-767979700711489ebed89b627848c82c namespace=jhub
time="2022-01-10T05:55:12Z" level=error msg="Cannot create service: subset not found" servicePort=8786 providerName=kubernetescrd ingress=dask-767979700711489ebed89b627848c82c namespace=jhub serviceName=dask-767979700711489ebed89b627848c82c
time="2022-01-10T05:55:14Z" level=error msg="subset not found for jhub/dask-767979700711489ebed89b627848c82c" namespace=jhub providerName=kubernetescrd ingress=dask-767979700711489ebed89b627848c82c
time="2022-01-10T05:55:14Z" level=error msg="Cannot create service: subset not found" servicePort=8786 ingress=dask-767979700711489ebed89b627848c82c namespace=jhub providerName=kubernetescrd serviceName=dask-767979700711489ebed89b627848c82c