Deploy dask gateway on Kubernetes as a JupyterHub service

Hi,
I manage deployments of Kubernetes on Openstack (XSEDE Jetstream) and we do not have Load Balancers.
In 2020 I managed to setup Dask Gateway 0.8.0 to be exposed as a JupyterHub service and it worked fine, I wrote a tutorial about that for fellow XSEDE users:

https://zonca.dev/2020/08/dask-gateway-jupyterhub.html

Now I am trying to update the tutorial to the latest JupyterHub and to Dask Gateway 0.9.0 but I cannot make it working.

In the JupyterHub configuration I have:

hub:
  services:
    dask-gateway:
      # This makes the gateway available at ${HUB_URL}/services/dask-gateway
      url: http://traefik-dask-gateway

In fact if I check:

https://js-xxx-xxx.jetstream-cloud.org/services/dask-gateway/api/health I get {status:pass}.

I access the gateway with:

gateway = Gateway(
    address="http://traefik-dask-gateway/services/dask-gateway/",
    public_address="https://js-xxx-xxx.jetstream-cloud.org/services/dask-gateway/",
    auth="jupyterhub")

I can:

  • create a new cluster
  • scale it up
  • access the dashboard

But it fails when I try to get the client:

>>> cluster.get_client()

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/comm/core.py in connect(addr, timeout, deserialize, handshake_overrides, **connection_args)
    319         # write, handshake = await asyncio.gather(comm.write(local_info), comm.read())
--> 320         handshake = await asyncio.wait_for(comm.read(), time_left())
    321         await asyncio.wait_for(comm.write(local_info), time_left())

/srv/conda/envs/notebook/lib/python3.8/asyncio/tasks.py in wait_for(fut, timeout, loop)
    493         if fut.done():
--> 494             return fut.result()
    495         else:

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/comm/tcp.py in read(self, deserializers)
    215             try:
--> 216                 frames = unpack_frames(frames)
    217 

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/protocol/utils.py in unpack_frames(b)
     69 
---> 70     (n_frames,) = struct.unpack_from(fmt, b)
     71     lengths = struct.unpack_from(f"{n_frames}{fmt}", b, fmt_size)

error: unpack_from requires a buffer of at least 8 bytes for unpacking 8 bytes at offset 0 (actual buffer size is 2)

The above exception was the direct cause of the following exception:

OSError                                   Traceback (most recent call last)
<ipython-input-12-affca45186d3> in <module>
----> 1 client = cluster.get_client()

/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in get_client(self, set_as_default)
   1076         client : dask.distributed.Client
   1077         """
-> 1078         client = Client(
   1079             self,
   1080             security=self.security,

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in __init__(self, address, loop, timeout, set_as_default, scheduler_file, security, asynchronous, name, heartbeat_interval, serializers, deserializers, extensions, direct_to_workers, connection_limit, **kwargs)
    752             ext(self)
    753 
--> 754         self.start(timeout=timeout)
    755         Client._instances.add(self)
    756 

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in start(self, **kwargs)
    965             self._started = asyncio.ensure_future(self._start(**kwargs))
    966         else:
--> 967             sync(self.loop, self._start, **kwargs)
    968 
    969     def __await__(self):

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    352     if error[0]:
    353         typ, exc, tb = error[0]
--> 354         raise exc.with_traceback(tb)
    355     else:
    356         return result[0]

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/utils.py in f()
    335             if callback_timeout is not None:
    336                 future = asyncio.wait_for(future, callback_timeout)
--> 337             result[0] = yield future
    338         except Exception as exc:
    339             error[0] = sys.exc_info()

/srv/conda/envs/notebook/lib/python3.8/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in _start(self, timeout, **kwargs)
   1055 
   1056         try:
-> 1057             await self._ensure_connected(timeout=timeout)
   1058         except (OSError, ImportError):
   1059             await self._close()

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in _ensure_connected(self, timeout)
   1112 
   1113         try:
-> 1114             comm = await connect(
   1115                 self.scheduler.address, timeout=timeout, **self.connection_args
   1116             )

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/comm/core.py in connect(addr, timeout, deserialize, handshake_overrides, **connection_args)
    323         with suppress(Exception):
    324             await comm.close()
--> 325         raise IOError(
    326             f"Timed out during handshake while connecting to {addr} after {timeout} s"
    327         ) from exc

OSError: Timed out during handshake while connecting to gateway://traefik-dask-gateway:80/jhub.72633e218e6a42d2830183f9535efc10 after 10 s

cluster.scheduler_address is 'gateway://traefik-dask-gateway:80/jhub.72633e218e6a42d2830183f9535efc10'

Maybe I need to provide a proxy_address to the Gateway class? Or is there a better way to achieve this? Also suggestions on how to better understand the issue are much appreciated.

There are also some errors in the Traefik logs, not sure if related:

time="2022-01-10T05:55:08Z" level=error msg="Cannot create service: subset not found" ingress=dask-767979700711489ebed89b627848c82c servicePort=8786 providerName=kubernetescrd serviceName=dask-767979700711489ebed89b627848c82c namespace=jhub
time="2022-01-10T05:55:10Z" level=error msg="subset not found for jhub/dask-767979700711489ebed89b627848c82c" providerName=kubernetescrd namespace=jhub ingress=dask-767979700711489ebed89b627848c82c
time="2022-01-10T05:55:10Z" level=error msg="Cannot create service: subset not found" providerName=kubernetescrd servicePort=8786 ingress=dask-767979700711489ebed89b627848c82c namespace=jhub serviceName=dask-767979700711489ebed89b627848c82c
time="2022-01-10T05:55:12Z" level=error msg="subset not found for jhub/dask-767979700711489ebed89b627848c82c" providerName=kubernetescrd ingress=dask-767979700711489ebed89b627848c82c namespace=jhub
time="2022-01-10T05:55:12Z" level=error msg="Cannot create service: subset not found" servicePort=8786 providerName=kubernetescrd ingress=dask-767979700711489ebed89b627848c82c namespace=jhub serviceName=dask-767979700711489ebed89b627848c82c
time="2022-01-10T05:55:14Z" level=error msg="subset not found for jhub/dask-767979700711489ebed89b627848c82c" namespace=jhub providerName=kubernetescrd ingress=dask-767979700711489ebed89b627848c82c
time="2022-01-10T05:55:14Z" level=error msg="Cannot create service: subset not found" servicePort=8786 ingress=dask-767979700711489ebed89b627848c82c namespace=jhub providerName=kubernetescrd serviceName=dask-767979700711489ebed89b627848c82c

Hi @zonca and welcome! I wonder if @guillaumeeb, you might be able to help with this?

Hm, this is too deep inside dask-gateway or on site Kubernetes configuration for me to help here… I can only suggest a few things (since Jim Crist isn’t on this forum):

  • Check the versions of Dask in your notebook/client and be sure they are compatible with dask-gateway 0.9.
  • Open this issue directly on dask-gateway github.
  • Check on the changelog or directly the code differences between dask-gateway 0.8 and 0.9 to identify the major changes. But his could also come from Jupyterhub or your Kubernetes instance…

cc @jacobtomlinson who knows much more than me on Kubernetes.

3 Likes

Thanks,
I am using the pangeo image used in daskhub (pangeo/base-notebook:2021.06.05) for the singleuser servers from helm-chart/values.yaml at main · dask/helm-chart · GitHub, so they should be fine:

dask_gateway.__version__
'0.9.0'
dask.__version__
'2021.06.0'
distributed.__version__
'2021.06.0'

Hi @zonca the issue is related to the dask version in the singleuser server. You need to downgrade it to (probably) the same version found on the image of the gateway server.

Do you happen to have k9s installed and have access to the cluster?

2 Likes

I’ve also recently deployed JupyterHub and Dask Gateway using the most recent DaskHub Chart with Helm and bumped into the very same problem. Be aware that hopefully there will soon be a new release of dask-gateway as discussed here.

This should do the trick:

pip install dask==2.30.0 distributed==2.30.0

If you install these versions directly from a Notebook with !pip ..., don’t forget to restart the kernel!

2 Likes

That fixed it, thanks @filippo82 !

1 Like

Glad to hear it solved your issue :muscle: