After deploying to EKS, following the basic helm instructions here Installing — Dask Kubernetes 0+untagged.50.gfa7255b documentation, my dask-operator pod is stuck in a CrashLoopBackOff. It seems like kopf gets authenticated but then can’t talk to the apiserver? Feels like I’m missing something simple, like a role or binding or similar. Wanted to ask here in case I was doing something wrong, before turning into a bug.
Cluster = EKS 1.25
Using latest dask/dask-kubernetes-operator
Pod Log
<snip>
[2023-12-20 01:00:36,564] kopf._core.engines.a [INFO ] Initial authentication has been initiated.
[2023-12-20 01:00:36,566] kopf.activities.auth [INFO ] Activity 'login_via_pykube' succeeded.
[2023-12-20 01:00:36,567] kopf.activities.auth [INFO ] Activity 'login_via_client' succeeded.
[2023-12-20 01:00:36,567] kopf._core.engines.a [INFO ] Initial authentication has finished.
[2023-12-20 01:00:36,683] kopf._core.reactor.o [ERROR ] Request attempt #1/9 failed; will retry: GET https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1 -> APIServerError(None, None)
[2023-12-20 01:00:37,707] kopf._core.reactor.o [ERROR ] Request attempt #2/9 failed; will retry: GET https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1 -> APIServerError(None, None)
[2023-12-20 01:00:38,722] kopf._core.reactor.o [ERROR ] Request attempt #3/9 failed; will retry: GET https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1 -> APIServerError(None, None)
[2023-12-20 01:00:40,752] kopf._core.reactor.o [ERROR ] Request attempt #4/9 failed; will retry: GET https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1 -> APIServerError(None, None)
[2023-12-20 01:00:43,772] kopf._core.reactor.o [ERROR ] Request attempt #5/9 failed; will retry: GET https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1 -> APIServerError(None, None)
[2023-12-20 01:00:44,838] kopf.activities.prob [INFO ] Activity 'now' succeeded.
[2023-12-20 01:00:44,838] aiohttp.access [INFO ] 10.0.1.250 [20/Dec/2023:01:00:44 +0000] "GET /healthz HTTP/1.1" 200 214 "-" "kube-probe/1.25+"
[2023-12-20 01:00:48,794] kopf._core.reactor.o [ERROR ] Request attempt #6/9 failed; will retry: GET https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1 -> APIServerError(None, None)
[2023-12-20 01:00:54,837] aiohttp.access [INFO ] 10.0.1.250 [20/Dec/2023:01:00:54 +0000] "GET /healthz HTTP/1.1" 200 214 "-" "kube-probe/1.25+"
[2023-12-20 01:00:56,817] kopf._core.reactor.o [ERROR ] Request attempt #7/9 failed; will retry: GET https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1 -> APIServerError(None, None)
[2023-12-20 01:01:04,838] kopf.activities.prob [INFO ] Activity 'now' succeeded.
[2023-12-20 01:01:04,838] aiohttp.access [INFO ] 10.0.1.250 [20/Dec/2023:01:01:04 +0000] "GET /healthz HTTP/1.1" 200 214 "-" "kube-probe/1.25+"
[2023-12-20 01:01:09,845] kopf._core.reactor.o [ERROR ] Request attempt #8/9 failed; will retry: GET https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1 -> APIServerError(None, None)
[2023-12-20 01:01:14,837] aiohttp.access [INFO ] 10.0.1.250 [20/Dec/2023:01:01:14 +0000] "GET /healthz HTTP/1.1" 200 214 "-" "kube-probe/1.25+"
[2023-12-20 01:01:24,837] kopf.activities.prob [INFO ] Activity 'now' succeeded.
[2023-12-20 01:01:24,837] aiohttp.access [INFO ] 10.0.1.250 [20/Dec/2023:01:01:24 +0000] "GET /healthz HTTP/1.1" 200 214 "-" "kube-probe/1.25+"
[2023-12-20 01:01:30,878] kopf._core.reactor.o [ERROR ] Request attempt #9/9 failed; escalating: GET https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1 -> APIServerError(None, None)
[2023-12-20 01:01:30,879] kopf._core.reactor.r [ERROR ] Resource observer has failed: (None, None)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/clients/errors.py", line 148, in check_response
response.raise_for_status()
File "/usr/local/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1011, in raise_for_status
raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 500, message='Internal Server Error', url=URL('https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 108, in guard
await coro
File "/usr/local/lib/python3.10/site-packages/kopf/_core/reactor/observation.py", line 113, in resource_observer
resources = await scanning.scan_resources(groups=group_filter, settings=settings, logger=logger)
File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/clients/scanning.py", line 31, in scan_resources
resources.update(await coro)
File "/usr/local/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one
return f.result() # May raise f.exception().
File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/clients/scanning.py", line 83, in _read_new_apis
resources.update(await coro)
File "/usr/local/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one
return f.result() # May raise f.exception().
File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/clients/scanning.py", line 97, in _read_version
rsp = await api.get(url, settings=settings, logger=logger)
File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 111, in get
response = await request(
File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 45, in wrapper
return await fn(*args, **kwargs, context=context)
File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 85, in request
await errors.check_response(response) # but do not parse it!
File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/clients/errors.py", line 150, in check_response
raise cls(payload, status=response.status) from e
kopf._cogs.clients.errors.APIServerError: (None, None)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/clients/errors.py", line 148, in check_response
response.raise_for_status()
File "/usr/local/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1011, in raise_for_status
raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 500, message='Internal Server Error', url=URL('https://172.20.0.1:443/apis/metrics.k8s.io/v1beta1')
<snip>
And here is a similar matching log from CloudWatch for the apiserver.
{
"kind": "Event",
"apiVersion": "audit.k8s.io/v1",
"level": "Metadata",
"auditID": "1a64f0b8-0f75-4993-8f4e-f2c38deb1820",
"stage": "ResponseComplete",
"requestURI": "/apis/metrics.k8s.io/v1beta1",
"verb": "get",
"user": {
"username": "system:serviceaccount:dask-operator:dask-kubernetes-operator-1703103335",
"uid": "eeff8d29-413b-4588-b5e7-a21857b8f3e5",
"groups": [
"system:serviceaccounts",
"system:serviceaccounts:dask-operator",
"system:authenticated"
],
"extra": {
"authentication.kubernetes.io/pod-name": [
"dask-kubernetes-operator-1703103335-54fbd9988d-v8qbj"
],
"authentication.kubernetes.io/pod-uid": [
"d148d216-0305-4512-870b-98d4d6596489"
]
}
},
"sourceIPs": [
"10.0.1.75"
],
"userAgent": "kopf/1.36.2",
"responseStatus": {
"metadata": {},
"code": 500
},
"requestReceivedTimestamp": "2023-12-21T00:12:01.865140Z",
"stageTimestamp": "2023-12-21T00:12:01.895668Z",
"annotations": {
"authorization.k8s.io/decision": "allow",
"authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"system:discovery\" of ClusterRole \"system:discovery\" to Group \"system:authenticated\""
}
}