Dask-cuda-worker not registering with scheduler

Hello,

I am using dask_kubernetes.KubeCluster to set up temporary clusters for my Prefect v2 workload.

When I create a cluster with “normal” pods – no GPU request – karpenter creates the nodes, and they register with the dask scheduler as expected:

2022-08-16 06:10:09,970 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://192.168.98.215:45345', name: 0, status: init, memory: 0, processing: 0>
2022-08-16 06:10:09,971 - distributed.scheduler - INFO - Starting worker compute stream, tcp://192.168.98.215:45345
2022-08-16 06:10:09,971 - distributed.core - INFO - Starting established connection

When I create a cluster with a pod template that requests a GPU, however, the nodes are created but they don’t register with the scheduler. Here’s the output from the dask-root daemonset pod on the worker:

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.download.nvidia.com/licenses/NVIDIA_Deep_Learning_Container_License.pdf

[I 2022-08-16 06:14:44.574 ServerApp] dask_labextension | extension was successfully linked.
[I 2022-08-16 06:14:44.574 ServerApp] jupyter_server_proxy | extension was successfully linked.
[W 2022-08-16 06:14:44.578 LabApp] 'token' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2022-08-16 06:14:44.578 LabApp] 'allow_origin' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2022-08-16 06:14:44.578 LabApp] 'base_url' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[I 2022-08-16 06:14:44.585 ServerApp] jupyterlab | extension was successfully linked.
[I 2022-08-16 06:14:44.585 ServerApp] jupyterlab_nvdashboard | extension was successfully linked.
[I 2022-08-16 06:14:44.594 ServerApp] nbclassic | extension was successfully linked.
[I 2022-08-16 06:14:44.595 ServerApp] Writing Jupyter server cookie secret to /root/.local/share/jupyter/runtime/jupyter_cookie_secret
[I 2022-08-16 06:14:44.745 ServerApp] notebook_shim | extension was successfully linked.
[W 2022-08-16 06:14:44.786 ServerApp] All authentication is disabled.  Anyone who can connect to this server will be able to run code.
[I 2022-08-16 06:14:44.787 ServerApp] notebook_shim | extension was successfully loaded.
[I 2022-08-16 06:14:44.788 ServerApp] dask_labextension | extension was successfully loaded.
[I 2022-08-16 06:14:45.197 ServerApp] jupyter_server_proxy | extension was successfully loaded.
[I 2022-08-16 06:14:45.198 LabApp] JupyterLab extension loaded from /opt/conda/envs/rapids/lib/python3.8/site-packages/jupyterlab
[I 2022-08-16 06:14:45.198 LabApp] JupyterLab application directory is /opt/conda/envs/rapids/share/jupyter/lab
[I 2022-08-16 06:14:45.201 ServerApp] jupyterlab | extension was successfully loaded.
[W 2022-08-16 06:14:45.202 ServerApp] jupyterlab_nvdashboard | extension failed loading with message: 'NoneType' object is not callable
[I 2022-08-16 06:14:45.205 ServerApp] nbclassic | extension was successfully loaded.
[I 2022-08-16 06:14:45.206 ServerApp] Serving notebooks from local directory: /rapids/notebooks
[I 2022-08-16 06:14:45.206 ServerApp] Jupyter Server 1.17.1 is running at:
[I 2022-08-16 06:14:45.206 ServerApp] http://dask-root-ec7e072e-7hw927:8888/lab
[I 2022-08-16 06:14:45.206 ServerApp]  or http://127.0.0.1:8888/lab
[I 2022-08-16 06:14:45.206 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

I’m not sure what “good” looks like, but pretty sure this isn’t it :grimacing:?

I’ll include the pod template I’m using to target these GPU-accelerated nodes below, if it helps.

Is there a way for me to debug what’s going on on these nodes which aren’t registering with the scheduler? The end result is that the Prefect workload isn’t being executed, because the Dask cluster is never ready to run GPU-y pods.

Pod template:

# Taken from https://kubernetes.dask.org/en/latest/kubecluster.html#gpus
# I have tried with the $DASK_SCHEDULER_ADDRESS argument, to no avail
kind: Pod
metadata:
spec:
  restartPolicy: Never
  containers:
    - image: rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04-py3.8
      imagePullPolicy: IfNotPresent
      args: [dask-cuda-worker]
      name: dask-cuda
      resources:
        limits:
          cpu: "1"
          memory: 2G
          nvidia.com/gpu: 1
        requests:
          cpu: "1"
          memory: 2G
          nvidia.com/gpu: 1

Some progress: I exec-ed into the Dask worker pod having problems, and ran dask-cuda-worker manually, with the scheduler URL:

(rapids) root@dask-root-ec7e072e-7hw927:/rapids/notebooks# dask-cuda-worker tcp://192.168.19.125:8786
2022-08-16 11:42:27,965 - distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.89.15:46615'
2022-08-16 11:42:29,277 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-08-16 11:42:30,238 - distributed.preloading - INFO - Run preload setup click command: dask_cuda.initialize
2022-08-16 11:42:30,238 - distributed.worker - INFO -       Start worker at:  tcp://192.168.89.15:42289
2022-08-16 11:42:30,238 - distributed.worker - INFO -          Listening to:  tcp://192.168.89.15:42289
2022-08-16 11:42:30,238 - distributed.worker - INFO -          dashboard at:        192.168.89.15:36265
2022-08-16 11:42:30,238 - distributed.worker - INFO - Waiting to connect to:  tcp://192.168.19.125:8786
2022-08-16 11:42:30,238 - distributed.worker - INFO - -------------------------------------------------
2022-08-16 11:42:30,238 - distributed.worker - INFO -               Threads:                          1
2022-08-16 11:42:30,238 - distributed.worker - INFO -                Memory:                   1.86 GiB
2022-08-16 11:42:30,238 - distributed.worker - INFO -       Local Directory: /rapids/notebooks/dask-worker-space/worker-gmnhwyo3
2022-08-16 11:42:30,239 - distributed.worker - INFO - Starting Worker plugin RMMSetup-36c62a33-cf86-41b8-88ca-16d6e91ca7a2
2022-08-16 11:42:30,239 - distributed.worker - INFO - Starting Worker plugin CPUAffinity-db29f50e-67ac-45a3-95db-ad15fe428cae
2022-08-16 11:42:30,239 - distributed.worker - INFO - Starting Worker plugin PreImport-dbcce83a-bf63-452d-84ea-c9376b43608d
2022-08-16 11:42:30,239 - distributed.worker - INFO - -------------------------------------------------

:raised_hands:

However, this throws up two follow-on questions:

  1. How should I determine the scheduler URL from the worker? There are lots of DASK_ROOT_* env vars on the worker, but not the DASK_SCHEDULER_ADDRESS env var present on the other, functional worker.
  2. On the scheduler side of things, I get this error when I manually connect the worker:
    2022-08-16 11:42:30,299 - distributed.core - ERROR - Scheduler.add_worker() missing 1 required keyword-only argument: 'server_id'
    Traceback (most recent call last):
      File "/code/.venv/lib/python3.10/site-packages/distributed/utils.py", line 799, in wrapper
        return await func(*args, **kwargs)
    TypeError: Scheduler.add_worker() missing 1 required keyword-only argument: 'server_id'
    2022-08-16 11:42:30,300 - distributed.core - ERROR - Exception while handling op register-worker
    Traceback (most recent call last):
      File "/code/.venv/lib/python3.10/site-packages/distributed/core.py", line 770, in _handle_comm
        result = await result
      File "/code/.venv/lib/python3.10/site-packages/distributed/utils.py", line 799, in wrapper
        return await func(*args, **kwargs)
    TypeError: Scheduler.add_worker() missing 1 required keyword-only argument: 'server_id'
    

Does anyone have any advice for how to proceed from here?

PS: For some reason, I’m not seeing this behaviour from the docs:

These worker pods are configured to shutdown if they are unable to connect to the scheduler for 60 seconds.

Looking at the logs it seems the RAPIDS container you are using is never getting as far as running dask-cuda-worker. This may be because Jupyter is hijacking the session. Can you try setting the DISABLE_JUPYTER environment variable to true in your pod template?

Also the errors you are seeing with missing 1 required keyword-only argument: 'server_id' suggests your scheduler and workers have mismatches Dask versions.

2 Likes

Thank you @jacobtomlinson! That guidance helped move me forward.

I think I had an underlying misunderstanding about what was happening here.

As quick background: I am using Prefect v2 workflows, so it’s that framework which is triggering the creation of the Dask KubeClusters.

I have one Prefect workflow which can run on non-GPU nodes, and one which requires GPUs.

My (mis)understanding was thinking that there would be exactly one Dask scheduler across both groups of nodes, but am I right that there would be one per cluster?

I think that’s right because the pod I was trying to debug as a worker actually has the dask.org/component=scheduler label.
CleanShot 2022-08-18 at 19.39.26

So! This makes me think that the above scheduler pod is somehow misconfigured: we’re not actually even getting to the point of creating a worker pod on the GPU cluster.

I used the image and configuration from the KubeCluster docs (https://kubernetes.dask.org/en/latest/kubecluster.html#gpus) – is there a different setup I could try?

PS: We don’t actually need e.g. cuDF for this work. Just CUDA and PyTorch – in case that makes it easier to pick a base image to work off!

Without seeing your Prefect configuration I wouldn’t be able to comment. I also am not an expert in Prefect by any stretch.

You’re right that you’re looking at the scheduler pod, so you’ll need to get that into a good state. It would be great if you could share the pod logs so we can see what is going on.

Thanks for linking to the docs, that image is a slightly older RAPIDS version so we should probably update that. But if you’re not using RAPIDS when maybe you just want to be using the Pytorch images instead? Just bear in mind that you’ll need dask in the image and the same image should be used for the scheduler and workers.

1 Like

I scoped things down to exclude Prefect – here’s what I did:

  • Created a pod from ghcr.io/dask/dask:latest in my k8s cluster (using this spec)
  • On that pod, created a KubeCluster and ran a function (using this code and config)
  • Got the result() of the function

What happened was:

  • The scheduler pod was created
  • karpenter created a node for it
  • The scheduler started up
  • A worker pod was never started :frowning:
  • The call to .result() hangs

Here are the logs from the scheduler:

2022-08-19 17:41:19,063 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-19 17:41:19,078 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-08-19 17:41:19,084 - distributed.scheduler - INFO - State start
2022-08-19 17:41:19,087 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-19 17:41:19,087 - distributed.scheduler - INFO - Clear task state
2022-08-19 17:41:19,088 - distributed.scheduler - INFO -   Scheduler at: tcp://192.168.27.190:8786
2022-08-19 17:41:19,088 - distributed.scheduler - INFO -   dashboard at:                     :8787
2022-08-19 17:41:21,530 - distributed.scheduler - INFO - Receive client connection: Client-22b91545-1fe6-11ed-8033-261c5023a76b
2022-08-19 17:41:21,542 - distributed.core - INFO - Starting established connection

Here’s what I saw on the first pod I created:

>>> from dask_kubernetes import KubeCluster
>>> from dask.distributed import Client
>>> 
>>> cluster = KubeCluster(
...     pod_template="worker.yaml", scheduler_pod_template="scheduler.yaml"
... )
Creating scheduler pod on cluster. This may take some time.
>>> client = Client(cluster)
/opt/conda/lib/python3.8/site-packages/distributed/client.py:1309: VersionMismatchWarning: Mismatched versions found

+---------+--------+-----------+---------+
| Package | client | scheduler | workers |
+---------+--------+-----------+---------+
| lz4     | 4.0.0  | None      | None    |
+---------+--------+-----------+---------+
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))
>>> 
>>> 
>>> def inc(x):
...     return x + 1
... 
>>> 
>>> x = client.submit(inc, 10)
>>> print(x)
<Future: pending, key: inc-f7d2f85ccaf5928477b0e14b484494ad>
>>> print(x.result()) # this hangs

Thanks so much for the help so far! It’s good to be making progress.

Thanks for the minimal example, that makes things much easier.

Can you try setting n_workers to 1 or more in KubeCluster?

2 Likes

That worked @jacobtomlinson!

Thanks so much for your help :+1:

1 Like