Hello!
I am using dask from Prefect 2, via dask_kubernetes.KubeCluster. My cluster is on EKS, and I have karpenter running to handle auto-scaling.
I followed these instructions to ask dask to schedule my pod on a GPU-accelerated node. When I run my task, the pod is created and karpenter starts up a new GPU-accelerated node – so far so good!
However, the created node doesn’t include the right labels / annotations and the dask pod remains unscheduleable.
The pod reports: “0/4 nodes are available: 1 Insufficient memory, 2 Insufficient cpu, 4 Insufficient nvidia.com/gpu”; I’ll include the node labels below.
I realise this might be more a karpenter question (I’ll ask there too), but does someone have guidance for how to use KubeCluster and karpenter together for GPU-enabled nodes?
Thanks!
PS the node karpenter starts has these labels:
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=g4dn.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-1
failure-domain.beta.kubernetes.io/zone=us-east-1a
k8s.io/cloud-provider-aws=9f7e64553201daa2383d7a15eb9f3857
karpenter.k8s.aws/instance-category=g
karpenter.k8s.aws/instance-cpu=4
karpenter.k8s.aws/instance-family=g4dn
karpenter.k8s.aws/instance-generation=4
karpenter.k8s.aws/instance-gpu-count=1
karpenter.k8s.aws/instance-gpu-manufacturer=nvidia
karpenter.k8s.aws/instance-gpu-memory=16384
karpenter.k8s.aws/instance-gpu-name=t4
karpenter.k8s.aws/instance-hypervisor=nitro
karpenter.k8s.aws/instance-local-nvme=125
karpenter.k8s.aws/instance-memory=16384
karpenter.k8s.aws/instance-pods=29
karpenter.k8s.aws/instance-size=xlarge
karpenter.sh/capacity-type=spot
karpenter.sh/provisioner-name=default
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-192-168-10-195.ec2.internal
kubernetes.io/os=linux
node.kubernetes.io/instance-type=g4dn.xlarge
topology.kubernetes.io/region=us-east-1
topology.kubernetes.io/zone=us-east-1a