Hello,
I am using dask_kubernetes.KubeCluster to set up temporary clusters for my Prefect v2 workload.
When I create a cluster with “normal” pods – no GPU request – karpenter creates the nodes, and they register with the dask scheduler as expected:
2022-08-16 06:10:09,970 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://192.168.98.215:45345', name: 0, status: init, memory: 0, processing: 0>
2022-08-16 06:10:09,971 - distributed.scheduler - INFO - Starting worker compute stream, tcp://192.168.98.215:45345
2022-08-16 06:10:09,971 - distributed.core - INFO - Starting established connection
When I create a cluster with a pod template that requests a GPU, however, the nodes are created but they don’t register with the scheduler. Here’s the output from the dask-root
daemonset pod on the worker:
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.download.nvidia.com/licenses/NVIDIA_Deep_Learning_Container_License.pdf
[I 2022-08-16 06:14:44.574 ServerApp] dask_labextension | extension was successfully linked.
[I 2022-08-16 06:14:44.574 ServerApp] jupyter_server_proxy | extension was successfully linked.
[W 2022-08-16 06:14:44.578 LabApp] 'token' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2022-08-16 06:14:44.578 LabApp] 'allow_origin' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2022-08-16 06:14:44.578 LabApp] 'base_url' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[I 2022-08-16 06:14:44.585 ServerApp] jupyterlab | extension was successfully linked.
[I 2022-08-16 06:14:44.585 ServerApp] jupyterlab_nvdashboard | extension was successfully linked.
[I 2022-08-16 06:14:44.594 ServerApp] nbclassic | extension was successfully linked.
[I 2022-08-16 06:14:44.595 ServerApp] Writing Jupyter server cookie secret to /root/.local/share/jupyter/runtime/jupyter_cookie_secret
[I 2022-08-16 06:14:44.745 ServerApp] notebook_shim | extension was successfully linked.
[W 2022-08-16 06:14:44.786 ServerApp] All authentication is disabled. Anyone who can connect to this server will be able to run code.
[I 2022-08-16 06:14:44.787 ServerApp] notebook_shim | extension was successfully loaded.
[I 2022-08-16 06:14:44.788 ServerApp] dask_labextension | extension was successfully loaded.
[I 2022-08-16 06:14:45.197 ServerApp] jupyter_server_proxy | extension was successfully loaded.
[I 2022-08-16 06:14:45.198 LabApp] JupyterLab extension loaded from /opt/conda/envs/rapids/lib/python3.8/site-packages/jupyterlab
[I 2022-08-16 06:14:45.198 LabApp] JupyterLab application directory is /opt/conda/envs/rapids/share/jupyter/lab
[I 2022-08-16 06:14:45.201 ServerApp] jupyterlab | extension was successfully loaded.
[W 2022-08-16 06:14:45.202 ServerApp] jupyterlab_nvdashboard | extension failed loading with message: 'NoneType' object is not callable
[I 2022-08-16 06:14:45.205 ServerApp] nbclassic | extension was successfully loaded.
[I 2022-08-16 06:14:45.206 ServerApp] Serving notebooks from local directory: /rapids/notebooks
[I 2022-08-16 06:14:45.206 ServerApp] Jupyter Server 1.17.1 is running at:
[I 2022-08-16 06:14:45.206 ServerApp] http://dask-root-ec7e072e-7hw927:8888/lab
[I 2022-08-16 06:14:45.206 ServerApp] or http://127.0.0.1:8888/lab
[I 2022-08-16 06:14:45.206 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
I’m not sure what “good” looks like, but pretty sure this isn’t it ?
I’ll include the pod template I’m using to target these GPU-accelerated nodes below, if it helps.
Is there a way for me to debug what’s going on on these nodes which aren’t registering with the scheduler? The end result is that the Prefect workload isn’t being executed, because the Dask cluster is never ready to run GPU-y pods.
Pod template:
# Taken from https://kubernetes.dask.org/en/latest/kubecluster.html#gpus
# I have tried with the $DASK_SCHEDULER_ADDRESS argument, to no avail
kind: Pod
metadata:
spec:
restartPolicy: Never
containers:
- image: rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04-py3.8
imagePullPolicy: IfNotPresent
args: [dask-cuda-worker]
name: dask-cuda
resources:
limits:
cpu: "1"
memory: 2G
nvidia.com/gpu: 1
requests:
cpu: "1"
memory: 2G
nvidia.com/gpu: 1