Multiple processes per worker while using gateway

bakht · April 22, 2022, 5:01pm

Hi I’m having a hard time figuring out why when I configure my dask clusters (created via gateway) to have multiple processes (instead of threads) for my cores. I only end up with 1 process that is not killed. I do end up seeing logs that indicate that multiple were started but only 1 actually registers. Greatly appreciate any help in debugging this

Running command: ['/home/dask/dask_worker.runfiles/__main__/dask/dask_worker.py', '--nthreads', '1', '--no-dashboard', '--death-timeout', '90', '--memory-limit', '0', '--nprocs', '10', 'tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786']
distributed.nanny - INFO -         Start Nanny at: 'tls://xx.xx.xx.x:38085'
distributed.nanny - INFO -         Start Nanny at: 'tls://xx.xx.xx.x:45103'
distributed.nanny - INFO -         Start Nanny at: 'tls://xx.xx.xx.x:33385'
distributed.nanny - INFO -         Start Nanny at: 'tls://xx.xx.xx.x:33421'
distributed.nanny - INFO -         Start Nanny at: 'tls://xx.xx.xx.x:41145'
distributed.nanny - INFO -         Start Nanny at: 'tls://xx.xx.xx.x:38879'
distributed.nanny - INFO -         Start Nanny at: 'tls://xx.xx.xx.x:39257'
distributed.nanny - INFO -         Start Nanny at: 'tls://xx.xx.xx.x:34977'
distributed.nanny - INFO -         Start Nanny at: 'tls://xx.xx.xx.x:41521'
distributed.nanny - INFO -         Start Nanny at: 'tls://xx.xx.xx.x:36827'
distributed.worker - INFO -       Start worker at:   tls://xx.xx.xx.x:33443
distributed.worker - INFO -          Listening to:   tls://xx.xx.xx.x:33443
distributed.worker - INFO -          dashboard at:         xx.xx.xx.x:32939
distributed.worker - INFO - Waiting to connect to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -       Local Directory: /home/dask/dask_worker.runfiles/__main__/dask-worker-space/worker-ue4pyb6w
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:   tls://xx.xx.xx.x:33929
distributed.worker - INFO -          Listening to:   tls://xx.xx.xx.x:33929
distributed.worker - INFO -          dashboard at:         xx.xx.xx.x:41447
distributed.worker - INFO - Waiting to connect to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -       Local Directory: /home/dask/dask_worker.runfiles/__main__/dask-worker-space/worker-7eq8azho
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO -       Start worker at:   tls://xx.xx.xx.x:36491
distributed.worker - INFO -          Listening to:   tls://xx.xx.xx.x:36491
distributed.worker - INFO -          dashboard at:         xx.xx.xx.x:42831
distributed.worker - INFO - Waiting to connect to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -       Local Directory: /home/dask/dask_worker.runfiles/__main__/dask-worker-space/worker-od46okqy
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO -       Start worker at:   tls://xx.xx.xx.x:39177
distributed.worker - INFO -          Listening to:   tls://xx.xx.xx.x:39177
distributed.worker - INFO -          dashboard at:         xx.xx.xx.x:33535
distributed.worker - INFO - Waiting to connect to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -       Local Directory: /home/dask/dask_worker.runfiles/__main__/dask-worker-space/worker-9k21u5mu
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:   tls://xx.xx.xx.x:39071
distributed.worker - INFO -          Listening to:   tls://xx.xx.xx.x:39071
distributed.worker - INFO -          dashboard at:         xx.xx.xx.x:46617
distributed.worker - INFO - Waiting to connect to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -       Local Directory: /home/dask/dask_worker.runfiles/__main__/dask-worker-space/worker-tp9q9sov
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO -       Start worker at:   tls://xx.xx.xx.x:35977
distributed.worker - INFO -          Listening to:   tls://xx.xx.xx.x:35977
distributed.worker - INFO -          dashboard at:         xx.xx.xx.x:36893
distributed.worker - INFO - Waiting to connect to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -       Local Directory: /home/dask/dask_worker.runfiles/__main__/dask-worker-space/worker-d1_sjtuv
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO -       Start worker at:   tls://xx.xx.xx.x:40385
distributed.worker - INFO -          Listening to:   tls://xx.xx.xx.x:40385
distributed.worker - INFO -          dashboard at:         xx.xx.xx.x:44713
distributed.worker - INFO - Waiting to connect to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -       Local Directory: /home/dask/dask_worker.runfiles/__main__/dask-worker-space/worker-ewxj01cd
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO -       Start worker at:   tls://xx.xx.xx.x:42931
distributed.worker - INFO -          Listening to:   tls://xx.xx.xx.x:42931
distributed.worker - INFO -          dashboard at:         xx.xx.xx.x:42455
distributed.worker - INFO - Waiting to connect to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -       Local Directory: /home/dask/dask_worker.runfiles/__main__/dask-worker-space/worker-r70fh3nz
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:   tls://xx.xx.xx.x:37297
distributed.worker - INFO -          Listening to:   tls://xx.xx.xx.x:37297
distributed.worker - INFO -          dashboard at:         xx.xx.xx.x:32853
distributed.worker - INFO - Waiting to connect to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -       Local Directory: /home/dask/dask_worker.runfiles/__main__/dask-worker-space/worker-foi0bi1v
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:   tls://xx.xx.xx.x:38107
distributed.worker - INFO -          Listening to:   tls://xx.xx.xx.x:38107
distributed.worker - INFO -          dashboard at:         xx.xx.xx.x:33155
distributed.worker - INFO - Waiting to connect to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -       Local Directory: /home/dask/dask_worker.runfiles/__main__/dask-worker-space/worker-48022gbb
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to: tls://dask-2fb8c4a2a57e49eca6bd276258effa04.daskgateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Stopping worker at tls://xx.xx.xx.x:42931
distributed.worker - INFO - Stopping worker at tls://xx.xx.xx.x:35977
distributed.worker - INFO - Stopping worker at tls://xx.xx.xx.x:39177
distributed.worker - INFO - Stopping worker at tls://xx.xx.xx.x:36491
distributed.worker - INFO - Stopping worker at tls://xx.xx.xx.x:37297
distributed.worker - INFO - Stopping worker at tls://xx.xx.xx.x:39071
distributed.worker - INFO - Stopping worker at tls://xx.xx.xx.x:38107
distributed.worker - INFO - Stopping worker at tls://xx.xx.xx.x:33929
distributed.worker - INFO - Stopping worker at tls://xx.xx.xx.x:40385
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Closing Nanny at 'tls://xx.xx.xx.x:38879'
distributed.nanny - INFO - Closing Nanny at 'tls://xx.xx.xx.x:36827'
distributed.nanny - INFO - Closing Nanny at 'tls://xx.xx.xx.x:41145'
distributed.nanny - INFO - Closing Nanny at 'tls://xx.xx.xx.x:41521'
distributed.nanny - INFO - Closing Nanny at 'tls://xx.xx.xx.x:39257'
distributed.nanny - INFO - Closing Nanny at 'tls://xx.xx.xx.x:34977'
distributed.nanny - INFO - Closing Nanny at 'tls://xx.xx.xx.x:33421'
distributed.nanny - INFO - Closing Nanny at 'tls://xx.xx.xx.x:38085'
distributed.nanny - INFO - Closing Nanny at 'tls://xx.xx.xx.x:45103'

bakht · April 22, 2022, 5:02pm

Is there something in Dask gateway that is killing these processes or any config in gateway I should be looking at, because initially I was thinking maybe it was timeout (and hence added the death-timeout option) ?

guillaumeeb · April 23, 2022, 11:49am

Hi @bakht,

Could you provide the code and configuration you used for creating this cluster using dask-gateway?

bakht · April 23, 2022, 2:29pm

Hi @guillaumeeb
rhank you for getting back to me. I think the issue is with the logic here: dask-gateway/scheduler_preload.py at main · dask/dask-gateway · GitHub

Having multiple processes per worker pod seems to work fine when I use KubeCluster without the gateway. I will attach my configuration file below for gateway and also how I’m connecting to gateway. I can provide that later today.

bakht · April 23, 2022, 8:47pm

Hi @guillaumeeb ,

Sorry for the delay here is the helm values file for gateway deployment and also how I’m creating the cluster via gateway

gateway = Gateway(
    address="http://daskgateway.com"
)


cluster= gateway.new_cluster()
cluster.adapt(minimum=1,maximum=6)

cluster

gateway:
  # Number of instances of the gateway-server to run
  replicas: 1

  # Annotations to apply to the gateway-server pods.
  annotations: {}

  # Resource requests/limits for the gateway-server pod.
  resources: {}

  # Path prefix to serve dask-gateway api requests under
  # This prefix will be added to all routes the gateway manages
  # in the traefik proxy.
  prefix: /

  # The gateway server log level
  loglevel: INFO

  # The image to use for the gateway-server pod.
  image:
    name: daskgateway/dask-gateway-server
    tag: 0.9.0
    pullPolicy: IfNotPresent

  # Image pull secrets for gateway-server pod
  imagePullSecrets: []

  # Configuration for the gateway-server service
  service:
    annotations: {}

  auth:
    # The auth type to use. One of {simple, kerberos, jupyterhub, custom}.
    type: simple

    simple:
      # A shared password to use for all users.
      password: null

    kerberos:
      # Path to the HTTP keytab for this node.
      keytab: null

    jupyterhub:
      # A JupyterHub api token for dask-gateway to use. See
      # https://gateway.dask.org/install-kube.html#authenticating-with-jupyterhub.
      apiToken: null

      # JupyterHub's api url. Inferred from JupyterHub's service name if running
      # in the same namespace.
      apiUrl: null

    custom:
      # The full authenticator class name.
      class: null

      # Configuration fields to set on the authenticator class.
      options: {}

  livenessProbe:
    # Enables the livenessProbe.
    enabled: true
    # Configures the livenessProbe.
    initialDelaySeconds: 5
    timeoutSeconds: 2
    periodSeconds: 10
    failureThreshold: 6
  readinessProbe:
    # Enables the readinessProbe.
    enabled: true
    # Configures the readinessProbe.
    initialDelaySeconds: 5
    timeoutSeconds: 2
    periodSeconds: 10
    failureThreshold: 3

  backend:
    # The image to use for both schedulers and workers.
    image:
      name: dask/dask_worker_example
      tag: am-1649864662
      pullPolicy: IfNotPresent

    # The namespace to launch dask clusters in. If not specified, defaults to
    # the same namespace the gateway is running in.
    namespace: null

    # A mapping of environment variables to set for both schedulers and workers.
    environment: null
    idleTimeout: 3600.0
    cluster:
      cores:
        limit: null
      memory:
        limit: null
      workers:
        limit: null

    scheduler:
      # Any extra configuration for the scheduler pod. Sets
      # `c.KubeClusterConfig.scheduler_extra_pod_config`.
      extraPodConfig: {}

      # Any extra configuration for the scheduler container.
      # Sets `c.KubeClusterConfig.scheduler_extra_container_config`.
      extraContainerConfig: {}

      # Cores request/limit for the scheduler.
      cores:
        request: 1
        limit: 1

      # Memory request/limit for the scheduler.
      memory:
        request: 10G
        limit: 10G

    worker:
      # Any extra configuration for the worker pod. Sets
      # `c.KubeClusterConfig.worker_extra_pod_config`.
      extraPodConfig:
        serviceAccountName: dask

      # Any extra configuration for the worker container. Sets
      # `c.KubeClusterConfig.worker_extra_container_config`.
      extraContainerConfig: {}

      # Cores request/limit for each worker.
      cores:
        request: 10
        limit: 10

      # Memory request/limit for each worker.
      memory:
        request: 7G
        limit: 7G

  # Settings for nodeSelector, affinity, and tolerations for the gateway pods
  nodeSelector: {}
  affinity: {}
  tolerations: []

  # Any extra configuration code to append to the generated `dask_gateway_config.py`
  # file. Can be either a single code-block, or a map of key -> code-block
  # (code-blocks are run in alphabetical order by key, the key value itself is
  # meaningless). The map version is useful as it supports merging multiple
  # `values.yaml` files, but is unnecessary in other cases.
  extraConfig:
    clusteroptions: |
      def options_handler(options):
        return {
            "image": options.image
        }
      from dask_gateway_server.options import Options, Integer, Float, String
      c.Backend.cluster_options = Options(
          String("image",default="dask/dask_worker_example:am-1649864662"),
          handler=options_handler,
      )

      #c.DaskGateway.backend_class = "mykubebackend.MyKubeBackend"

    worker_config: |
      c.KubeClusterConfig.worker_cmd = ["dask-worker", "--no-dashboard", "--memory-limit", "0", "--nprocs", "10"]
      c.ClusterConfig.worker_threads = 1

# Configuration for the gateway controller
controller:
  # Whether the controller should be deployed. Disabling the controller allows
  # running it locally for development/debugging purposes.
  enabled: true

  # Any annotations to add to the controller pod
  annotations: {}

  # Resource requests/limits for the controller pod
  resources: {}

  # Image pull secrets for controller pod
  imagePullSecrets: []

  # The controller log level
  loglevel: DEBUG

  # Max time (in seconds) to keep around records of completed clusters.
  # Default is 24 hours.
  completedClusterMaxAge: 86400

  # Time (in seconds) between cleanup tasks removing records of completed
  # clusters. Default is 5 minutes.
  completedClusterCleanupPeriod: 600

  # Base delay (in seconds) for backoff when retrying after failures.
  backoffBaseDelay: 0.1

  # Max delay (in seconds) for backoff when retrying after failures.
  backoffMaxDelay: 300

  # Limit on the average number of k8s api calls per second.
  k8sApiRateLimit: 50

  # Limit on the maximum number of k8s api calls per second.
  k8sApiRateLimitBurst: 100

  # The image to use for the controller pod.
  image:
    name: daskgateway/dask-gateway-server
    tag: 0.9.0
    pullPolicy: IfNotPresent

  # Settings for nodeSelector, affinity, and tolerations for the controller pods
  nodeSelector: {}
  affinity: {}
  tolerations: []

# Configuration for the traefik proxy
traefik:
  # Number of instances of the proxy to run
  replicas: 1

  # Any annotations to add to the proxy pods
  annotations: {}

  # Resource requests/limits for the proxy pods
  resources: {}

  # The image to use for the proxy pod
  image:
    name: traefik
    tag: 2.1.3

  # Any additional arguments to forward to traefik
  additionalArguments: []

  # The proxy log level
  loglevel: WARN

  # Whether to expose the dashboard on port 9000 (enable for debugging only!)
  dashboard: false

  # Additional configuration for the traefik service
  service:
    type: LoadBalancer
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-internal: 'true'
      service.beta.kubernetes.io/aws-load-balancer-type: 'NLRB'
      external-dns.alpha.kubernetes.io/hostname: 'daskgateway.com'
    spec: {}
    ports:
      web:
        # The port HTTP(s) requests will be served on
        port: 80
        nodePort: null
      tcp:
        # The port TCP requests will be served on. Set to `web` to share the
        # web service port
        port: web
        nodePort: null

  # Settings for nodeSelector, affinity, and tolerations for the traefik pods
  nodeSelector: {}
  affinity: {}
  tolerations: []
ambassador:
  cluster: scratch
rbac:
  # Whether to enable RBAC.
  enabled: true

  # Existing names to use if ClusterRoles, ClusterRoleBindings, and
  # ServiceAccounts have already been created by other means (leave set to
  # `null` to create all required roles at install time)
  controller:
    serviceAccountName: null

  gateway:
    serviceAccountName: null

  traefik:
    serviceAccountName: null

guillaumeeb · April 24, 2022, 9:07am

OK,

I believe (but not entirely sure of this) that the problem is the argument to adapt or scale methods is the number of worker processes. There is actually no such thing in Dask as Workers with multiple processes. When you start a pod with 10 processes, you actually start a pod with 10 workers in it.

So when you call adapt there, it starts a pod with all the options you asked, because dask-gateway does not know how to do differently. But then, it compares 10 workers with only 1asked for, and it stops the 9 others.

So either you try scaling by multiple of tens in your setup, but I don’t know how dask-gateway handles this (will it start only one pod if you ask for 10 processes, or is there some mismatch in the code somewhere?), either you’ll need to have pods which have only one worker process.

In the end, having one pod with 10 processes or 10 pods with one processes will result to approximately the same performances.

tracycarley · April 26, 2022, 3:29pm

@guillaumeeb thanks for your input.
Re: There is actually no such thing in Dask as Workers with multiple processes. When you start a pod with 10 processes, you actually start a pod with 10 workers in it.
Do you mind sharing the source code of the above?

Re: In the end, having one pod with 10 processes or 10 pods with one processes will result to approximately the same performances.

One pod with 10 processes seems less resource-hungry than 10 pods with one process to me. Does that make sense to raise it to the Dask Gateway as this seems to be counter-intuitive of how we wanted to use adaptive() to achieve both flexibility and cost-efficiency of managing dask clusters.

guillaumeeb · April 27, 2022, 2:40pm

github.com

dask/distributed/blob/main/distributed/cli/dask_worker.py#L435


      
              nonlocal signal_fired
              signum = await wait_for_signals()
          
              signal_fired = True
              if nanny:
                  # Unregister all workers from scheduler
                  await asyncio.gather(
                      *(n.close(timeout=10, reason=f"signal-{signum}") for n in nannies)
                  )
          
          wait_for_signals_and_close_task = asyncio.create_task(
              wait_for_signals_and_close()
          )
          wait_for_nannies_to_finish_task = asyncio.create_task(
              wait_for_nannies_to_finish()
          )
          
          done, _ = await asyncio.wait(
              [wait_for_signals_and_close_task, wait_for_nannies_to_finish_task],
              return_when=asyncio.FIRST_COMPLETED,
          )

You can see that n_workers or n_procs nannies are created here, each with a different process and network address.

If you match the resources of the workers with the resources asked by the pods in Kubernetes, I believe this is the same. The pod or Container overhead should be close to 0.

You should probably look first how scale method is actually implemented in dask-gateway. Does it really doesn’t make the difference between Worker individual processes and pods when calling scale? I don’t know. It should at least be clear: do we scale pods or processes?

Topic		Replies	Views
Worker pods exist but client cannot connect to them or workers do not accept jobs Deploying Dask dask-gateway , kubernetes , distributed	7	76	June 27, 2024
Multi-GPU dask gateway pods Deploying Dask	5	153	December 1, 2023
Run dask in parallel doesn't work as expected, in distributed kubernetes pods Distributed	11	489	March 17, 2023
Dask Cluster stuck in the pending status and shutdown yourself with Dask Gateway over the Slurm HPC Cluster Deploying Dask dask-gateway , distributed	12	1174	December 27, 2023
Not able to get new cluster from a Dask-gateway deployed on a local k8s cluster Deploying Dask dask-gateway	11	365	November 24, 2023

Multiple processes per worker while using gateway

Related topics