Dask gateway workers pods always get CrashLoopBackOff status

nbarinov · September 27, 2023, 12:09pm

Hello, dear collegues!
We have a Dask Gateway 2023.9.0 installed on the Kubernetes cluster (EKS) with IPv6. When I tried to create a cluster all workes pods got a status CrashLoopBackOff and in the logs, I saw text like this

/home/dask/.local/lib/python3.11/site-packages/distributed/cli/dask_worker.py:266: FutureWarning: dask-worker is deprecated and will be removed in a future release; use `dask worker` instead
  warnings.warn(
/home/dask/.local/lib/python3.11/site-packages/distributed/utils.py:165: RuntimeWarning: Couldn't detect a suitable IP address for reaching 'dask-2884f65ecbc44103ac47e7c620232833.dask', defaulting to hostname: [Errno -5] No address associated with hostname
  warnings.warn(
2023-09-26 08:07:32,383 - distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/home/dask/.local/lib/python3.11/site-packages/toolz/functoolz.py", line 457, in memof
    return cache[k]
           ~~~~~^^^
KeyError: ('dask-2884f65ecbc44103ac47e7c620232833.dask', 80)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/utils.py", line 161, in _get_ip
    sock.connect((host, port))
socket.gaierror: [Errno -5] No address associated with hostname

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dask/.local/bin/dask-worker", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/cli/dask_worker.py", line 447, in main
    asyncio.run(run())
  File "/usr/local/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/cli/dask_worker.py", line 397, in run
    nannies = [
              ^
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/cli/dask_worker.py", line 398, in <listcomp>
    t(
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/nanny.py", line 281, in __init__
    host = get_ip(get_address_host(self.scheduler.address))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/utils.py", line 185, in get_ip
    return _get_ip(host, port, family=socket.AF_INET)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/toolz/functoolz.py", line 461, in memof
    cache[k] = result = func(*args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/utils.py", line 170, in _get_ip
    addr_info = socket.getaddrinfo(
                ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/socket.py", line 962, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno -2] Name or service not known

The dask scheduler logs

2023-09-27 09:07:26,912 - distributed.scheduler - INFO - State start
2023-09-27 09:07:26,915 - distributed.scheduler - INFO - -----------------------------------------------
2023-09-27 09:07:26,917 - distributed.scheduler - INFO -   Scheduler at: tls://169.254.175.125:8786
2023-09-27 09:07:26,917 - distributed.scheduler - INFO -   dashboard at:  http://169.254.175.125:8787/status
2023-09-27 09:07:26,917 - distributed.preloading - INFO - Run preload setup: dask_gateway.scheduler_preload

I’m not sure but It seems that the scheduler does not listen to IPV6 and workers can’t connect to it. If I’m right how can I configure the Dask Gateway Helm chart to fix it?

guillaumeeb · September 28, 2023, 8:22am

Hi @nbarinov, wecome to Dask discourse forum!

So you want to change the Scheduler default listening interface. I browsed a bit the dask-gateway helm chart configuration possibilities, and I think your best chance is to modify the scheduler_cmd in KubeClusterConfig through extraConfig.

I’m not sure if this is the best way, and maybe you’ll have to configure things on Kubernetes or Docker side too…

cc @jacobtomlinson.

nbarinov · September 28, 2023, 9:21am

Thank you!
But in the description I see
c.ClusterConfig.scheduler_cmd = Command()*
No help string is provided.
How can I define this option correctly? Maybe there are any similar examples to help me?

guillaumeeb · October 4, 2023, 1:37pm

The default to this configuration option is dask-scheduler: https://github.com/dask/dask-gateway/blob/9fda6da1fd5d117038d876f2ed6890a99b0e813e/dask-gateway-server/dask_gateway_server/backends/base.py#L243.

We can see it is used here, where a default_host is set to 0.0.0.0. I’ not familiar with IPv6, but this does look to listening to all interfaces in IPv4. There doesn’t seem to be a way to change this default host, but you could try to set scheduler_cmd to dask-scheduler --host "::", or maybe empty string?

nbarinov · October 5, 2023, 12:57pm

Thank you! I will try to use your advice.

nbarinov · October 17, 2023, 9:14am

@guillaumeeb We’ve tried to apply this config changes in the helm values.yaml file

  extraConfig: #{}
    ClusterConfig: | # options for IPv6
      c.ClusterConfig.scheduler_cmd = ["dask-scheduler", "--host", "[::]"]

After that, the cluster did not start
Do you have any ideas on how to write this config properly?

guillaumeeb · October 18, 2023, 7:38pm

I’m really not sure, don’t you have any stack trace?

cc @jacobtomlinson.

nbarinov · October 20, 2023, 12:00pm

Only one error was a message in the client console that the cluster could not start.
By the way, I’ve checked my Dask gateway config with EKS without IPv6 and did not get this issue. I would be happy if we could start the scheduler pod with IPv6.
I appreciate your help!

nbarinov · October 23, 2023, 2:33pm

I’ve provided additional investigation and found some new information. The function that returns exception I found here distributed.utils — Dask.distributed 2023.10.0+15.gb4eee3f documentation

def _get_ip(host, port, family):
    # By using a UDP socket, we don't actually try to connect but
    # simply select the local address through which *host* is reachable.
    sock = socket.socket(family, socket.SOCK_DGRAM)
    try:
        sock.connect((host, port))
        ip = sock.getsockname()[0]
        return ip
    except OSError as e:
        warnings.warn(
            "Couldn't detect a suitable IP address for "
            "reaching %r, defaulting to hostname: %s" % (host, e),
            RuntimeWarning,
        )
        addr_info = socket.getaddrinfo(
            socket.gethostname(), port, family, socket.SOCK_DGRAM, socket.IPPROTO_UDP
        )[0]
        return addr_info[4][0]
    finally:
        sock.close()

I’ve created the cluster again and used the debugging container to check access to the address that was the worker’s error logs. When I ping the address with IP v6 I did not see an error

root@swiss-army-knife:~# ping dask-10f13cec40f94f92abc133c7991c9260
PING dask-10f13cec40f94f92abc133c7991c9260.dask(2600-1f16-30d-8204-f436--6.dask-10f13cec40f94f92abc133c7991c9260.dask.svc.cluster.local (2600:1f16:30d:8204:f436::6)) 56 data bytes
64 bytes from 2600-1f16-30d-8204-f436--6.dask-10f13cec40f94f92abc133c7991c9260.dask.svc.cluster.local (2600:1f16:30d:8204:f436::6): icmp_seq=1 ttl=62 time=0.708 ms
64 bytes from 2600-1f16-30d-8204-f436--6.dask-10f13cec40f94f92abc133c7991c9260.dask.svc.cluster.local (2600:1f16:30d:8204:f436::6): icmp_seq=2 ttl=62 time=0.713 ms
64 bytes from 2600-1f16-30d-8204-f436--6.dask-10f13cec40f94f92abc133c7991c9260.dask.svc.cluster.local (2600:1f16:30d:8204:f436::6): icmp_seq=3 ttl=62 time=0.731 ms

But if I tried to do the same with IPv4 I got an error

root@swiss-army-knife:~# ping -4 dask-10f13cec40f94f92abc133c7991c9260
ping: dask-10f13cec40f94f92abc133c7991c9260: No address associated with hostname

guillaumeeb · October 23, 2023, 6:21pm

This definitely looks like some problem with setting up IPv6 on Dask side, but I really don’t know how to help. I’m hopping other will chime in.

Topic		Replies	Views
Worker pods exist but client cannot connect to them or workers do not accept jobs Deploying Dask dask-gateway , kubernetes , distributed	7	75	June 27, 2024
Dask-operator on eks stuck in CrashLoopBackOff Deploying Dask kubernetes	3	158	January 9, 2024
Installing Dask Workers on Partcular Node Pool Distributed dask-gateway , distributed	3	368	January 31, 2024
Dask gateway server shuts down issue Deploying Dask dask-gateway , kubernetes , distributed	1	180	April 27, 2023
Multi-GPU dask gateway pods Deploying Dask	5	150	December 1, 2023

Dask gateway workers pods always get CrashLoopBackOff status

Related topics