Dask gateway workers pods always get CrashLoopBackOff status

Hello, dear collegues!
We have a Dask Gateway 2023.9.0 installed on the Kubernetes cluster (EKS) with IPv6. When I tried to create a cluster all workes pods got a status CrashLoopBackOff and in the logs, I saw text like this

/home/dask/.local/lib/python3.11/site-packages/distributed/cli/dask_worker.py:266: FutureWarning: dask-worker is deprecated and will be removed in a future release; use `dask worker` instead
  warnings.warn(
/home/dask/.local/lib/python3.11/site-packages/distributed/utils.py:165: RuntimeWarning: Couldn't detect a suitable IP address for reaching 'dask-2884f65ecbc44103ac47e7c620232833.dask', defaulting to hostname: [Errno -5] No address associated with hostname
  warnings.warn(
2023-09-26 08:07:32,383 - distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/home/dask/.local/lib/python3.11/site-packages/toolz/functoolz.py", line 457, in memof
    return cache[k]
           ~~~~~^^^
KeyError: ('dask-2884f65ecbc44103ac47e7c620232833.dask', 80)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/utils.py", line 161, in _get_ip
    sock.connect((host, port))
socket.gaierror: [Errno -5] No address associated with hostname

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dask/.local/bin/dask-worker", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/cli/dask_worker.py", line 447, in main
    asyncio.run(run())
  File "/usr/local/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/cli/dask_worker.py", line 397, in run
    nannies = [
              ^
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/cli/dask_worker.py", line 398, in <listcomp>
    t(
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/nanny.py", line 281, in __init__
    host = get_ip(get_address_host(self.scheduler.address))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/utils.py", line 185, in get_ip
    return _get_ip(host, port, family=socket.AF_INET)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/toolz/functoolz.py", line 461, in memof
    cache[k] = result = func(*args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/utils.py", line 170, in _get_ip
    addr_info = socket.getaddrinfo(
                ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/socket.py", line 962, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno -2] Name or service not known

The dask scheduler logs

2023-09-27 09:07:26,912 - distributed.scheduler - INFO - State start
2023-09-27 09:07:26,915 - distributed.scheduler - INFO - -----------------------------------------------
2023-09-27 09:07:26,917 - distributed.scheduler - INFO -   Scheduler at: tls://169.254.175.125:8786
2023-09-27 09:07:26,917 - distributed.scheduler - INFO -   dashboard at:  http://169.254.175.125:8787/status
2023-09-27 09:07:26,917 - distributed.preloading - INFO - Run preload setup: dask_gateway.scheduler_preload

I’m not sure but It seems that the scheduler does not listen to IPV6 and workers can’t connect to it. If I’m right how can I configure the Dask Gateway Helm chart to fix it?

Hi @nbarinov, wecome to Dask discourse forum!

So you want to change the Scheduler default listening interface. I browsed a bit the dask-gateway helm chart configuration possibilities, and I think your best chance is to modify the scheduler_cmd in KubeClusterConfig through extraConfig.

I’m not sure if this is the best way, and maybe you’ll have to configure things on Kubernetes or Docker side too…

cc @jacobtomlinson.

Thank you!
But in the description I see
c.ClusterConfig.scheduler_cmd = Command()*
No help string is provided.
How can I define this option correctly? Maybe there are any similar examples to help me?

The default to this configuration option is dask-scheduler: https://github.com/dask/dask-gateway/blob/9fda6da1fd5d117038d876f2ed6890a99b0e813e/dask-gateway-server/dask_gateway_server/backends/base.py#L243.

We can see it is used here, where a default_host is set to 0.0.0.0. I’ not familiar with IPv6, but this does look to listening to all interfaces in IPv4. There doesn’t seem to be a way to change this default host, but you could try to set scheduler_cmd to dask-scheduler --host "::", or maybe empty string?

Thank you! I will try to use your advice.

@guillaumeeb We’ve tried to apply this config changes in the helm values.yaml file

  extraConfig: #{}
    ClusterConfig: | # options for IPv6
      c.ClusterConfig.scheduler_cmd = ["dask-scheduler", "--host", "[::]"]

After that, the cluster did not start :sweat_smile:
Do you have any ideas on how to write this config properly?

I’m really not sure, don’t you have any stack trace?

cc @jacobtomlinson.

Only one error was a message in the client console that the cluster could not start.
By the way, I’ve checked my Dask gateway config with EKS without IPv6 and did not get this issue. I would be happy if we could start the scheduler pod with IPv6.
I appreciate your help!

I’ve provided additional investigation and found some new information. The function that returns exception I found here distributed.utils — Dask.distributed 2023.10.0+15.gb4eee3f documentation

def _get_ip(host, port, family):
    # By using a UDP socket, we don't actually try to connect but
    # simply select the local address through which *host* is reachable.
    sock = socket.socket(family, socket.SOCK_DGRAM)
    try:
        sock.connect((host, port))
        ip = sock.getsockname()[0]
        return ip
    except OSError as e:
        warnings.warn(
            "Couldn't detect a suitable IP address for "
            "reaching %r, defaulting to hostname: %s" % (host, e),
            RuntimeWarning,
        )
        addr_info = socket.getaddrinfo(
            socket.gethostname(), port, family, socket.SOCK_DGRAM, socket.IPPROTO_UDP
        )[0]
        return addr_info[4][0]
    finally:
        sock.close()

I’ve created the cluster again and used the debugging container to check access to the address that was the worker’s error logs. When I ping the address with IP v6 I did not see an error

root@swiss-army-knife:~# ping dask-10f13cec40f94f92abc133c7991c9260
PING dask-10f13cec40f94f92abc133c7991c9260.dask(2600-1f16-30d-8204-f436--6.dask-10f13cec40f94f92abc133c7991c9260.dask.svc.cluster.local (2600:1f16:30d:8204:f436::6)) 56 data bytes
64 bytes from 2600-1f16-30d-8204-f436--6.dask-10f13cec40f94f92abc133c7991c9260.dask.svc.cluster.local (2600:1f16:30d:8204:f436::6): icmp_seq=1 ttl=62 time=0.708 ms
64 bytes from 2600-1f16-30d-8204-f436--6.dask-10f13cec40f94f92abc133c7991c9260.dask.svc.cluster.local (2600:1f16:30d:8204:f436::6): icmp_seq=2 ttl=62 time=0.713 ms
64 bytes from 2600-1f16-30d-8204-f436--6.dask-10f13cec40f94f92abc133c7991c9260.dask.svc.cluster.local (2600:1f16:30d:8204:f436::6): icmp_seq=3 ttl=62 time=0.731 ms

But if I tried to do the same with IPv4 I got an error

root@swiss-army-knife:~# ping -4 dask-10f13cec40f94f92abc133c7991c9260
ping: dask-10f13cec40f94f92abc133c7991c9260: No address associated with hostname

This definitely looks like some problem with setting up IPv6 on Dask side, but I really don’t know how to help. I’m hopping other will chime in.

1 Like