Stuck at "Waiting for scheduler to run"

Hello,

I’m trying to deploy a cluster on google cloud with dask_cloudprovider.gcp.
I’ve removed everything related to my project and I’m basically just creating the cluster.
I see that the scheduler gets created but the script gets stuck, nothing happens.

Can anybody point me in any direction to debug this situation ?

1 Like

Hi @dragospopa420, welcome here!

Could you be a bit more precise on what you are doing? Which script is getting stuck and where?

Hi,

I am having a similar problem when trying to run the test at Google Cloud Platform — Dask Cloud Provider 2021.6.0+48.gf1965ad documentation. This is my code:

with GCPCluster(n_workers=1, zone='europe-west4-a', projectid='my_working_project_id', asynchronous=False, debug=True, silence_logs=False, source_image='projects/ubuntu-os-cloud/global/images/ubuntu-minimal-1804-bionic-v20230502') as cluster:  
    with Client(cluster) as client:
        print(da.random.random((1000, 1000), chunks=(100, 100)).mean().compute())

code runs fine in the sense that I see that the scheduler instance is created succesfully. It then prints ‘Waiting for scheduler to run at xx.xxx.x.xx:8786’ and hangs; so the second with statement is not reached.

Hi @msignore, welcome here!

Do you also see a Worker instance created? Cloud you give all the logs you got on your instances?

Hi @guillaumeeb, I got the same problem.
There is no Worker instance created. Just one Scheduler Instance.
Is it possible that the Service Account doesn’t have enough permissions.
Which roles / permissions should we give to the Service Account? I can’t find it anywhere on Dask documents.

Log of the Scheduler Instance:

[
  {
    "protoPayload": {
      "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
      "authenticationInfo": {
        "principalEmail": "*******-**@**-****-****.iam.gserviceaccount.com",
        "serviceAccountKeyName": "//iam.googleapis.com/projects/**-****-****/serviceAccounts/*******-**@**-****-****.iam.gserviceaccount.com/keys/*********************************",
        "principalSubject": "serviceAccount:*******-**@**-****-****.iam.gserviceaccount.com"
      },
      "requestMetadata": {
        "callerIp": "**.**.***.***",
        "callerSuppliedUserAgent": "(gzip),gzip(gfe)",
        "requestAttributes": {
          "time": "2023-05-08T10:12:08.999051Z",
          "auth": {}
        },
        "destinationAttributes": {}
      },
      "serviceName": "compute.googleapis.com",
      "methodName": "v1.compute.instances.insert",
      "authorizationInfo": [
        {
          "permission": "compute.instances.create",
          "granted": true,
          "resourceAttributes": {
            "service": "compute",
            "name": "projects/**-****-****/zones/us-east1-c/instances/dask-8ef7b12a-scheduler",
            "type": "compute.instances"
          }
        },
        {
          "permission": "compute.disks.create",
          "granted": true,
          "resourceAttributes": {
            "service": "compute",
            "name": "projects/**-****-****/zones/us-east1-c/disks/dask-8ef7b12a-scheduler",
            "type": "compute.disks"
          }
        },
        {
          "permission": "compute.subnetworks.use",
          "granted": true,
          "resourceAttributes": {
            "service": "compute",
            "name": "projects/**-****-****/regions/us-east1/subnetworks/default",
            "type": "compute.subnetworks"
          }
        },
        {
          "permission": "compute.subnetworks.useExternalIp",
          "granted": true,
          "resourceAttributes": {
            "service": "compute",
            "name": "projects/**-****-****/regions/us-east1/subnetworks/default",
            "type": "compute.subnetworks"
          }
        },
        {
          "permission": "compute.instances.setMetadata",
          "granted": true,
          "resourceAttributes": {
            "service": "compute",
            "name": "projects/**-****-****/zones/us-east1-c/instances/dask-8ef7b12a-scheduler",
            "type": "compute.instances"
          }
        },
        {
          "permission": "compute.instances.setTags",
          "granted": true,
          "resourceAttributes": {
            "service": "compute",
            "name": "projects/**-****-****/zones/us-east1-c/instances/dask-8ef7b12a-scheduler",
            "type": "compute.instances"
          }
        },
        {
          "permission": "compute.instances.setLabels",
          "granted": true,
          "resourceAttributes": {
            "service": "compute",
            "name": "projects/**-****-****/zones/us-east1-c/instances/dask-8ef7b12a-scheduler",
            "type": "compute.instances"
          }
        },
        {
          "permission": "compute.instances.setServiceAccount",
          "granted": true,
          "resourceAttributes": {
            "service": "compute",
            "name": "projects/**-****-****/zones/us-east1-c/instances/dask-8ef7b12a-scheduler",
            "type": "compute.instances"
          }
        }
      ],
      "resourceName": "projects/**-****-****/zones/us-east1-c/instances/dask-8ef7b12a-scheduler",
      "request": {
        "name": "dask-8ef7b12a-scheduler",
        "tags": {
          "tags": [
            "http-server",
            "https-server"
          ]
        },
        "machineType": "zones/us-east1-c/machineTypes/n2-standard-2",
        "canIpForward": false,
        "networkInterfaces": [
          {
            "accessConfigs": [
              {
                "type": "ONE_TO_ONE_NAT",
                "name": "External NAT",
                "networkTier": "PREMIUM"
              }
            ],
            "subnetwork": "projects/**-****-****/regions/us-east1/subnetworks/default"
          }
        ],
        "disks": [
          {
            "type": "PERSISTENT",
            "mode": "READ_WRITE",
            "deviceName": "dask-8ef7b12a-scheduler",
            "boot": true,
            "initializeParams": {
              "sourceImage": "projects/ubuntu-os-cloud/global/images/ubuntu-minimal-1804-bionic-v20201014",
              "diskSizeGb": "50",
              "diskType": "projects/**-****-****/zones/us-east1-c/diskTypes/pd-standard"
            },
            "autoDelete": true
          }
        ],
        "serviceAccounts": [
          {
            "email": "default",
            "scopes": [
              "https://www.googleapis.com/auth/devstorage.read_write",
              "https://www.googleapis.com/auth/logging.write",
              "https://www.googleapis.com/auth/monitoring.write"
            ]
          }
        ],
        "scheduling": {
          "onHostMaintenance": "TERMINATE",
          "automaticRestart": true,
          "preemptible": false
        },
        "labels": [
          {
            "key": "container_vm",
            "value": "dask-cloudprovider"
          }
        ],
        "deletionProtection": false,
        "reservationAffinity": {
          "consumeReservationType": "ANY_ALLOCATION"
        },
        "displayDevice": {
          "enableDisplay": false
        },
        "shieldedInstanceConfig": {
          "enableSecureBoot": false,
          "enableVtpm": true,
          "enableIntegrityMonitoring": true
        },
        "@type": "type.googleapis.com/compute.instances.insert"
      },
      "response": {
        "id": "6369220010983126039",
        "name": "operation-1683540728152-5fb2bdf110052-8d9223d7-901da2c1",
        "zone": "https://www.googleapis.com/compute/v1/projects/**-****-****/zones/us-east1-c",
        "operationType": "insert",
        "targetLink": "https://www.googleapis.com/compute/v1/projects/**-****-****/zones/us-east1-c/instances/dask-8ef7b12a-scheduler",
        "targetId": "7446392160937931799",
        "status": "RUNNING",
        "user": "*******-**@**-****-****.iam.gserviceaccount.com",
        "progress": "0",
        "insertTime": "2023-05-08T03:12:08.909-07:00",
        "startTime": "2023-05-08T03:12:08.910-07:00",
        "selfLink": "https://www.googleapis.com/compute/v1/projects/**-****-****/zones/us-east1-c/operations/operation-1683540728152-5fb2bdf110052-8d9223d7-901da2c1",
        "selfLinkWithId": "https://www.googleapis.com/compute/v1/projects/**-****-****/zones/us-east1-c/operations/6369220010983126039",
        "@type": "type.googleapis.com/operation"
      },
      "resourceLocation": {
        "currentLocations": [
          "us-east1-c"
        ]
      }
    },
    "insertId": "-g2r0xuejrjyu",
    "resource": {
      "type": "gce_instance",
      "labels": {
        "zone": "us-east1-c",
        "project_id": "**-****-****",
        "instance_id": "7446392160937931799"
      }
    },
    "timestamp": "2023-05-08T10:12:08.181048Z",
    "severity": "NOTICE",
    "logName": "projects/**-****-****/logs/cloudaudit.googleapis.com%2Factivity",
    "operation": {
      "id": "operation-1683540728152-5fb2bdf110052-8d9223d7-901da2c1",
      "producer": "compute.googleapis.com",
      "first": true
    },
    "receiveTimestamp": "2023-05-08T10:12:09.510022825Z"
  },
  {
    "insertId": "0",
    "jsonPayload": {
      "@type": "type.googleapis.com/cloud_integrity.IntegrityEvent",
      "startupEvent": {},
      "bootCounter": "1"
    },
    "resource": {
      "type": "gce_instance",
      "labels": {
        "project_id": "**-****-****",
        "zone": "us-east1-c",
        "instance_id": "7446392160937931799"
      }
    },
    "timestamp": "2023-05-08T10:12:12.705225299Z",
    "severity": "NOTICE",
    "logName": "projects/**-****-****/logs/compute.googleapis.com%2Fshielded_vm_integrity",
    "receiveTimestamp": "2023-05-08T10:12:14.718245835Z"
  },
  {
    "protoPayload": {
      "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
      "authenticationInfo": {
        "principalEmail": "*******-**@**-****-****.iam.gserviceaccount.com",
        "serviceAccountKeyName": "//iam.googleapis.com/projects/**-****-****/serviceAccounts/*******-**@**-****-****.iam.gserviceaccount.com/keys/*********************************",
        "principalSubject": "serviceAccount:*******-**@**-****-****.iam.gserviceaccount.com"
      },
      "requestMetadata": {
        "callerIp": "**.**.***.***",
        "callerSuppliedUserAgent": "(gzip),gzip(gfe)",
        "requestAttributes": {},
        "destinationAttributes": {}
      },
      "serviceName": "compute.googleapis.com",
      "methodName": "v1.compute.instances.insert",
      "resourceName": "projects/**-****-****/zones/us-east1-c/instances/dask-8ef7b12a-scheduler",
      "request": {
        "@type": "type.googleapis.com/compute.instances.insert"
      }
    },
    "insertId": "-crcxord8p9k",
    "resource": {
      "type": "gce_instance",
      "labels": {
        "zone": "us-east1-c",
        "project_id": "**-****-****",
        "instance_id": "7446392160937931799"
      }
    },
    "timestamp": "2023-05-08T10:12:13.459567Z",
    "severity": "NOTICE",
    "logName": "projects/**-****-****/logs/cloudaudit.googleapis.com%2Factivity",
    "operation": {
      "id": "operation-1683540728152-5fb2bdf110052-8d9223d7-901da2c1",
      "producer": "compute.googleapis.com",
      "last": true
    },
    "receiveTimestamp": "2023-05-08T10:12:13.681857331Z"
  },
  {
    "insertId": "1",
    "jsonPayload": {
      "@type": "type.googleapis.com/cloud_integrity.IntegrityEvent",
      "bootCounter": "1",
      "earlyBootReportEvent": {
        "policyMeasurements": [
          {
            "value": "0KUbR/mg3+8bwW0xjQtbxWh/1gI=",
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_0"
          },
          {
            "pcrNum": "PCR_7",
            "hashAlgo": "SHA1",
            "value": "jwk4ZGvqD/g7cbCA762EALidNFw="
          }
        ],
        "policyEvaluationPassed": true,
        "actualMeasurements": [
          {
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_0",
            "value": "0KUbR/mg3+8bwW0xjQtbxWh/1gI="
          },
          {
            "pcrNum": "PCR_1",
            "hashAlgo": "SHA1",
            "value": "KtcGVWNIdo9pYPgCGcLsoGrtI6Q="
          },
          {
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_2",
            "value": "sqg7Dr8vg3Qpmlsr38MeqVWtcjY="
          },
          {
            "pcrNum": "PCR_3",
            "hashAlgo": "SHA1",
            "value": "sqg7Dr8vg3Qpmlsr38MeqVWtcjY="
          },
          {
            "value": "oP1c2wBIal+pGhF9lzBYFCXpPfM=",
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_4"
          },
          {
            "value": "AuI+039CzYHi8qp4KyhiWOouLcs=",
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_5"
          },
          {
            "hashAlgo": "SHA1",
            "value": "sqg7Dr8vg3Qpmlsr38MeqVWtcjY=",
            "pcrNum": "PCR_6"
          },
          {
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_7",
            "value": "jwk4ZGvqD/g7cbCA762EALidNFw="
          }
        ]
      }
    },
    "resource": {
      "type": "gce_instance",
      "labels": {
        "instance_id": "7446392160937931799",
        "zone": "us-east1-c",
        "project_id": "**-****-****"
      }
    },
    "timestamp": "2023-05-08T10:12:14.078796466Z",
    "severity": "NOTICE",
    "logName": "projects/**-****-****/logs/compute.googleapis.com%2Fshielded_vm_integrity",
    "receiveTimestamp": "2023-05-08T10:12:14.718245835Z"
  },
  {
    "insertId": "2",
    "jsonPayload": {
      "lateBootReportEvent": {
        "policyEvaluationPassed": true,
        "policyMeasurements": [
          {
            "value": "0KUbR/mg3+8bwW0xjQtbxWh/1gI=",
            "pcrNum": "PCR_0",
            "hashAlgo": "SHA1"
          },
          {
            "pcrNum": "PCR_4",
            "value": "vWn2UuE90y1JkDVGJ0AftI2wUbY=",
            "hashAlgo": "SHA1"
          },
          {
            "value": "jwk4ZGvqD/g7cbCA762EALidNFw=",
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_7"
          }
        ],
        "actualMeasurements": [
          {
            "value": "0KUbR/mg3+8bwW0xjQtbxWh/1gI=",
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_0"
          },
          {
            "pcrNum": "PCR_1",
            "value": "KtcGVWNIdo9pYPgCGcLsoGrtI6Q=",
            "hashAlgo": "SHA1"
          },
          {
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_2",
            "value": "sqg7Dr8vg3Qpmlsr38MeqVWtcjY="
          },
          {
            "value": "sqg7Dr8vg3Qpmlsr38MeqVWtcjY=",
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_3"
          },
          {
            "pcrNum": "PCR_4",
            "value": "vWn2UuE90y1JkDVGJ0AftI2wUbY=",
            "hashAlgo": "SHA1"
          },
          {
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_5",
            "value": "B7aW7ovPgQJoFsfmRrWtFUGU17I="
          },
          {
            "value": "sqg7Dr8vg3Qpmlsr38MeqVWtcjY=",
            "pcrNum": "PCR_6",
            "hashAlgo": "SHA1"
          },
          {
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_7",
            "value": "jwk4ZGvqD/g7cbCA762EALidNFw="
          }
        ]
      },
      "@type": "type.googleapis.com/cloud_integrity.IntegrityEvent",
      "bootCounter": "1"
    },
    "resource": {
      "type": "gce_instance",
      "labels": {
        "zone": "us-east1-c",
        "project_id": "**-****-****",
        "instance_id": "7446392160937931799"
      }
    },
    "timestamp": "2023-05-08T10:12:29.032293358Z",
    "severity": "NOTICE",
    "logName": "projects/**-****-****/logs/compute.googleapis.com%2Fshielded_vm_integrity",
    "receiveTimestamp": "2023-05-08T10:12:31.041739950Z"
  }
]

Hi @toanlekafi, welcome here.

I found this link: Google Cloud Platform — Dask Cloud Provider 2021.6.0+48.gf1965ad documentation. But it’s true that it is not detailed. Do you see some authorizations rejections in your log?

I’m not sure of how dask-cloudprovider works, but maybe it waits for the Scheduler to be fully started before creating Worker instances. So if the Scheduler is not accessible, nothing else would happen.

cc @jacobtomlinson.

Yes it waits fore the scheduler to be accessible for launching the workers. This is a limitation of the SpecCluster class in distributed. The problem is usually firewall rules allowing access to the scheduler.

Once your cluster is hung in that state can you check that you can access the dashboard on the scheduler VM?

Thank @jacobtomlinson and @guillaumeeb for the reply.

I added these four VPC rules and got some progress:

  • egress 0.0.0.0/0 on all ports for downloading docker images and general data access
  • ingress 10.0.0.0/8 on all ports for internal communication of workers
  • ingress 0.0.0.0/0 on 8786-8787 for external accessibility of the dashboard/scheduler
  • (optional) ingress 0.0.0.0./0 on 22 for ssh access

Worker instance is created.

But then got asyncio.exceptions.CancelledError.

Trace:

CancelledError                            Traceback (most recent call last)
Cell In[26], line 6
      1 with GCPCluster(n_workers=1, zone='asia-northeast1-c', machine_type='n2-standard-2') as cluster:
      2     with Client(cluster) as client:
----> 3         print(da.random.random((1000, 1000), chunks=(100, 100)).mean().compute())

File ~/miniconda3/envs/data_env/lib/python3.9/site-packages/dask/base.py:315, in DaskMethodsMixin.compute(self, **kwargs)
    291 def compute(self, **kwargs):
    292     """Compute this dask collection
    293 
    294     This turns a lazy Dask collection into its in-memory equivalent.
   (...)
    313     dask.base.compute
    314     """
--> 315     (result,) = compute(self, traverse=False, **kwargs)
    316     return result

File ~/miniconda3/envs/data_env/lib/python3.9/site-packages/dask/base.py:603, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    600     keys.append(x.__dask_keys__())
    601     postcomputes.append(x.__dask_postcompute__())
--> 603 results = schedule(dsk, keys, **kwargs)
    604 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

File ~/miniconda3/envs/data_env/lib/python3.9/site-packages/distributed/client.py:3000, in Client.get(self, dsk, keys, workers, allow_other_workers, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   2998         should_rejoin = False
   2999 try:
-> 3000     results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   3001 finally:
   3002     for f in futures.values():

File ~/miniconda3/envs/data_env/lib/python3.9/site-packages/distributed/client.py:2174, in Client.gather(self, futures, errors, direct, asynchronous)
   2172 else:
   2173     local_worker = None
-> 2174 return self.sync(
   2175     self._gather,
   2176     futures,
   2177     errors=errors,
   2178     direct=direct,
   2179     local_worker=local_worker,
   2180     asynchronous=asynchronous,
   2181 )

File ~/miniconda3/envs/data_env/lib/python3.9/site-packages/distributed/utils.py:338, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    336     return future
    337 else:
--> 338     return sync(
    339         self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    340     )

File ~/miniconda3/envs/data_env/lib/python3.9/site-packages/distributed/utils.py:405, in sync(loop, func, callback_timeout, *args, **kwargs)
    403 if error:
    404     typ, exc, tb = error
--> 405     raise exc.with_traceback(tb)
    406 else:
    407     return result

File ~/miniconda3/envs/data_env/lib/python3.9/site-packages/distributed/utils.py:378, in sync.<locals>.f()
    376         future = asyncio.wait_for(future, callback_timeout)
    377     future = asyncio.ensure_future(future)
--> 378     result = yield future
    379 except Exception:
    380     error = sys.exc_info()

File ~/miniconda3/envs/data_env/lib/python3.9/site-packages/tornado/gen.py:762, in Runner.run(self)
    759 exc_info = None
    761 try:
--> 762     value = future.result()
    763 except Exception:
    764     exc_info = sys.exc_info()

File ~/miniconda3/envs/data_env/lib/python3.9/site-packages/distributed/client.py:2038, in Client._gather(self, futures, errors, direct, local_worker)
   2036     else:
   2037         raise exception.with_traceback(traceback)
-> 2038     raise exc
   2039 if errors == "skip":
   2040     bad_keys.add(key)

CancelledError: ('mean_agg-aggregate-f86fb7189c2aca41759d4514c3ffbb30',)

The code works fine using a LocalCluser, e.g.:

from distributed import LocalCluster, Client
import dask.array as da

with LocalCluster() as cluster:
    with Client(cluster) as client:
        print(da.random.random((1000, 1000), chunks=(100, 100)).mean().compute())

Does the error comes right after the Dask cluster is spawned? Can you open the Dashboard and have some insights of what might be happening?

I am facing similar issues with ECcluster, but my workers are never created, the scheduler is terminated, and the process hangs:

with EC2Cluster(env_vars=get_aws_credentials(), ami='an-ami-id', vpc='a-vpc-id', bootstrap=True, 
                security_groups=['a-seuritygroupo'], security=False, debug=True, 
                n_workers=2) as cluster:
    with Client(cluster) as client:
        print(da.random.random((1000, 1000), chunks=(100, 100)).mean().compute())

the security group has these rules:

Inbound
Custom TCP    TCP   8786 - 878   0.0.0.0/0	
All TCP       TCP   0 - 65535    10.0.0.0/8

Outbound
All traffic	All	All	0.0.0.0/0

It creates the scheduler no problem, but no workers. Any advice would be appreciated

Hi @jshleap, did you try to test what @jacobtomlinson said above? Can you access scheduler dashboard?

Hey @guillaumeeb , unfortunately I cannot access the dashboard despite that the security group is set for 8787 (that got cropped in the info I put above). Am I missing any other permissions?
Interestingly enough, the scheduler is terminated in AWS after some time, but the call just hangs in the notebook.

Is there a way to control AMIs for the worker nodes? would it use the same AMI as the scheduler? The only thing I can think that might be happening is that the scheduler is not having the permission to create worker nodes, either by lack of credentials or wrong AMI (the account limit the AMIs to use). Any suggestions?

All the Dask cluster nodes will use the same AMI.

The Scheduler doesn’t create worker, it’s the Python process where you are building the EC2Cluster which does it.

According to @jacobtomlinson previous answer:

it waits fore the scheduler to be accessible for launching the workers. This is a limitation of the SpecCluster class in distributed.

If your Scheduler is not accessible, then no workers will be created.

@guillaumeeb thanks for the answer, but then why wouldn’t it be accessible? all firewalls and permissions are set as expected, no? I can create the scheduler, so is not about credentials. All required ports are open, so it should not be permissions. Any Idea how can I proceed?

Edit: I even tried to allow all inbound and outbound access in the security group, with the same outcome.

This is probably related to some missing setup, but honestly I’m not an expert in setting these things, so it’s hard to help for me. Is there some rules on the Client side, the Python process in which you execute the code? Are you able to create a LocalCluster and access it’s dashboard?

Yup, LocalCluster works fine, is just cloud_provider that fails

I’m getting the same issue - one thing I’ve noticed is that this only happens when I initiate the cluster locally. In that case it seems to hang indefinitely. If I run the exact same repo on a VM it works fine. I’m using the same service key locally as on the VM.

So there must be some configuration missing locally?

What do you mean by locally? On your laptop? IS the network configuration allowing it to talk to the Scheduler?