Dask-operator on eks stuck in CrashLoopBackOff

After deploying to EKS, following the basic helm instructions here Installing — Dask Kubernetes 0+untagged.50.gfa7255b documentation, my dask-operator pod is stuck in a CrashLoopBackOff. It seems like kopf gets authenticated but then can’t talk to the apiserver? Feels like I’m missing something simple, like a role or binding or similar. Wanted to ask here in case I was doing something wrong, before turning into a bug.

Cluster = EKS 1.25
Using latest dask/dask-kubernetes-operator

Pod Log

<snip>
[2023-12-20 01:00:36,564] kopf._core.engines.a [INFO    ] Initial authentication has been initiated.
[2023-12-20 01:00:36,566] kopf.activities.auth [INFO    ] Activity 'login_via_pykube' succeeded.
[2023-12-20 01:00:36,567] kopf.activities.auth [INFO    ] Activity 'login_via_client' succeeded.
[2023-12-20 01:00:36,567] kopf._core.engines.a [INFO    ] Initial authentication has finished.
[2023-12-20 01:00:36,683] kopf._core.reactor.o [ERROR   ] Request attempt #1/9 failed; will retry: GET https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1 -> APIServerError(None, None)
[2023-12-20 01:00:37,707] kopf._core.reactor.o [ERROR   ] Request attempt #2/9 failed; will retry: GET https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1 -> APIServerError(None, None)
[2023-12-20 01:00:38,722] kopf._core.reactor.o [ERROR   ] Request attempt #3/9 failed; will retry: GET https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1 -> APIServerError(None, None)
[2023-12-20 01:00:40,752] kopf._core.reactor.o [ERROR   ] Request attempt #4/9 failed; will retry: GET https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1 -> APIServerError(None, None)
[2023-12-20 01:00:43,772] kopf._core.reactor.o [ERROR   ] Request attempt #5/9 failed; will retry: GET https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1 -> APIServerError(None, None)
[2023-12-20 01:00:44,838] kopf.activities.prob [INFO    ] Activity 'now' succeeded.
[2023-12-20 01:00:44,838] aiohttp.access       [INFO    ] 10.0.1.250 [20/Dec/2023:01:00:44 +0000] "GET /healthz HTTP/1.1" 200 214 "-" "kube-probe/1.25+"
[2023-12-20 01:00:48,794] kopf._core.reactor.o [ERROR   ] Request attempt #6/9 failed; will retry: GET https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1 -> APIServerError(None, None)
[2023-12-20 01:00:54,837] aiohttp.access       [INFO    ] 10.0.1.250 [20/Dec/2023:01:00:54 +0000] "GET /healthz HTTP/1.1" 200 214 "-" "kube-probe/1.25+"
[2023-12-20 01:00:56,817] kopf._core.reactor.o [ERROR   ] Request attempt #7/9 failed; will retry: GET https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1 -> APIServerError(None, None)
[2023-12-20 01:01:04,838] kopf.activities.prob [INFO    ] Activity 'now' succeeded.
[2023-12-20 01:01:04,838] aiohttp.access       [INFO    ] 10.0.1.250 [20/Dec/2023:01:01:04 +0000] "GET /healthz HTTP/1.1" 200 214 "-" "kube-probe/1.25+"
[2023-12-20 01:01:09,845] kopf._core.reactor.o [ERROR   ] Request attempt #8/9 failed; will retry: GET https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1 -> APIServerError(None, None)
[2023-12-20 01:01:14,837] aiohttp.access       [INFO    ] 10.0.1.250 [20/Dec/2023:01:01:14 +0000] "GET /healthz HTTP/1.1" 200 214 "-" "kube-probe/1.25+"
[2023-12-20 01:01:24,837] kopf.activities.prob [INFO    ] Activity 'now' succeeded.
[2023-12-20 01:01:24,837] aiohttp.access       [INFO    ] 10.0.1.250 [20/Dec/2023:01:01:24 +0000] "GET /healthz HTTP/1.1" 200 214 "-" "kube-probe/1.25+"
[2023-12-20 01:01:30,878] kopf._core.reactor.o [ERROR   ] Request attempt #9/9 failed; escalating: GET https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1 -> APIServerError(None, None)
[2023-12-20 01:01:30,879] kopf._core.reactor.r [ERROR   ] Resource observer has failed: (None, None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/clients/errors.py", line 148, in check_response
    response.raise_for_status()
  File "/usr/local/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1011, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 500, message='Internal Server Error', url=URL('https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 108, in guard
    await coro
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/reactor/observation.py", line 113, in resource_observer
    resources = await scanning.scan_resources(groups=group_filter, settings=settings, logger=logger)
  File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/clients/scanning.py", line 31, in scan_resources
    resources.update(await coro)
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/clients/scanning.py", line 83, in _read_new_apis
    resources.update(await coro)
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/clients/scanning.py", line 97, in _read_version
    rsp = await api.get(url, settings=settings, logger=logger)
  File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 111, in get
    response = await request(
  File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 45, in wrapper
    return await fn(*args, **kwargs, context=context)
  File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 85, in request
    await errors.check_response(response)  # but do not parse it!
  File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/clients/errors.py", line 150, in check_response
    raise cls(payload, status=response.status) from e
kopf._cogs.clients.errors.APIServerError: (None, None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/clients/errors.py", line 148, in check_response
    response.raise_for_status()
  File "/usr/local/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1011, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 500, message='Internal Server Error', url=URL('https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1')
<snip>

And here is a similar matching log from CloudWatch for the apiserver.

{
    "kind": "Event",
    "apiVersion": "audit.k8s.io/v1",
    "level": "Metadata",
    "auditID": "1a64f0b8-0f75-4993-8f4e-f2c38deb1820",
    "stage": "ResponseComplete",
    "requestURI": "/apis/metrics.k8s.io/v1beta1",
    "verb": "get",
    "user": {
        "username": "system:serviceaccount:dask-operator:dask-kubernetes-operator-1703103335",
        "uid": "eeff8d29-413b-4588-b5e7-a21857b8f3e5",
        "groups": [
            "system:serviceaccounts",
            "system:serviceaccounts:dask-operator",
            "system:authenticated"
        ],
        "extra": {
            "authentication.kubernetes.io/pod-name": [
                "dask-kubernetes-operator-1703103335-54fbd9988d-v8qbj"
            ],
            "authentication.kubernetes.io/pod-uid": [
                "d148d216-0305-4512-870b-98d4d6596489"
            ]
        }
    },
    "sourceIPs": [
        "10.0.1.75"
    ],
    "userAgent": "kopf/1.36.2",
    "responseStatus": {
        "metadata": {},
        "code": 500
    },
    "requestReceivedTimestamp": "2023-12-21T00:12:01.865140Z",
    "stageTimestamp": "2023-12-21T00:12:01.895668Z",
    "annotations": {
        "authorization.k8s.io/decision": "allow",
        "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"system:discovery\" of ClusterRole \"system:discovery\" to Group \"system:authenticated\""
    }
}

Hi @bespin, welcome to Dask community!

Well, if you just ran the helm command on the documentation page, I don’t think there is something missing on Dask side. How did you deploy the EKS Kubernetes cluster? Do you have some security rules? Can you see the logs of the ApiServer?

cc @jacobtomlinson

It looks to me like your EKS cluster is not healthy. You’re getting internal server errors from the Kubernetes metrics service.

Turns out we had a very old version of metrics server running that wasn’t compatible with the k8s version. We upgraded the metrics server from 0.3.7 to 0.6.4 to resolve the issue.

1 Like