Stuck at "Waiting for scheduler to run"

Hello,

I’m trying to deploy a cluster on google cloud with dask_cloudprovider.gcp.
I’ve removed everything related to my project and I’m basically just creating the cluster.
I see that the scheduler gets created but the script gets stuck, nothing happens.

Can anybody point me in any direction to debug this situation ?

1 Like

Hi @dragospopa420, welcome here!

Could you be a bit more precise on what you are doing? Which script is getting stuck and where?

Hi,

I am having a similar problem when trying to run the test at Google Cloud Platform — Dask Cloud Provider 2021.6.0+48.gf1965ad documentation. This is my code:

with GCPCluster(n_workers=1, zone='europe-west4-a', projectid='my_working_project_id', asynchronous=False, debug=True, silence_logs=False, source_image='projects/ubuntu-os-cloud/global/images/ubuntu-minimal-1804-bionic-v20230502') as cluster:  
    with Client(cluster) as client:
        print(da.random.random((1000, 1000), chunks=(100, 100)).mean().compute())

code runs fine in the sense that I see that the scheduler instance is created succesfully. It then prints ‘Waiting for scheduler to run at xx.xxx.x.xx:8786’ and hangs; so the second with statement is not reached.

Hi @msignore, welcome here!

Do you also see a Worker instance created? Cloud you give all the logs you got on your instances?

Hi @guillaumeeb, I got the same problem.
There is no Worker instance created. Just one Scheduler Instance.
Is it possible that the Service Account doesn’t have enough permissions.
Which roles / permissions should we give to the Service Account? I can’t find it anywhere on Dask documents.

Log of the Scheduler Instance:

[
  {
    "protoPayload": {
      "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
      "authenticationInfo": {
        "principalEmail": "*******-**@**-****-****.iam.gserviceaccount.com",
        "serviceAccountKeyName": "//iam.googleapis.com/projects/**-****-****/serviceAccounts/*******-**@**-****-****.iam.gserviceaccount.com/keys/*********************************",
        "principalSubject": "serviceAccount:*******-**@**-****-****.iam.gserviceaccount.com"
      },
      "requestMetadata": {
        "callerIp": "**.**.***.***",
        "callerSuppliedUserAgent": "(gzip),gzip(gfe)",
        "requestAttributes": {
          "time": "2023-05-08T10:12:08.999051Z",
          "auth": {}
        },
        "destinationAttributes": {}
      },
      "serviceName": "compute.googleapis.com",
      "methodName": "v1.compute.instances.insert",
      "authorizationInfo": [
        {
          "permission": "compute.instances.create",
          "granted": true,
          "resourceAttributes": {
            "service": "compute",
            "name": "projects/**-****-****/zones/us-east1-c/instances/dask-8ef7b12a-scheduler",
            "type": "compute.instances"
          }
        },
        {
          "permission": "compute.disks.create",
          "granted": true,
          "resourceAttributes": {
            "service": "compute",
            "name": "projects/**-****-****/zones/us-east1-c/disks/dask-8ef7b12a-scheduler",
            "type": "compute.disks"
          }
        },
        {
          "permission": "compute.subnetworks.use",
          "granted": true,
          "resourceAttributes": {
            "service": "compute",
            "name": "projects/**-****-****/regions/us-east1/subnetworks/default",
            "type": "compute.subnetworks"
          }
        },
        {
          "permission": "compute.subnetworks.useExternalIp",
          "granted": true,
          "resourceAttributes": {
            "service": "compute",
            "name": "projects/**-****-****/regions/us-east1/subnetworks/default",
            "type": "compute.subnetworks"
          }
        },
        {
          "permission": "compute.instances.setMetadata",
          "granted": true,
          "resourceAttributes": {
            "service": "compute",
            "name": "projects/**-****-****/zones/us-east1-c/instances/dask-8ef7b12a-scheduler",
            "type": "compute.instances"
          }
        },
        {
          "permission": "compute.instances.setTags",
          "granted": true,
          "resourceAttributes": {
            "service": "compute",
            "name": "projects/**-****-****/zones/us-east1-c/instances/dask-8ef7b12a-scheduler",
            "type": "compute.instances"
          }
        },
        {
          "permission": "compute.instances.setLabels",
          "granted": true,
          "resourceAttributes": {
            "service": "compute",
            "name": "projects/**-****-****/zones/us-east1-c/instances/dask-8ef7b12a-scheduler",
            "type": "compute.instances"
          }
        },
        {
          "permission": "compute.instances.setServiceAccount",
          "granted": true,
          "resourceAttributes": {
            "service": "compute",
            "name": "projects/**-****-****/zones/us-east1-c/instances/dask-8ef7b12a-scheduler",
            "type": "compute.instances"
          }
        }
      ],
      "resourceName": "projects/**-****-****/zones/us-east1-c/instances/dask-8ef7b12a-scheduler",
      "request": {
        "name": "dask-8ef7b12a-scheduler",
        "tags": {
          "tags": [
            "http-server",
            "https-server"
          ]
        },
        "machineType": "zones/us-east1-c/machineTypes/n2-standard-2",
        "canIpForward": false,
        "networkInterfaces": [
          {
            "accessConfigs": [
              {
                "type": "ONE_TO_ONE_NAT",
                "name": "External NAT",
                "networkTier": "PREMIUM"
              }
            ],
            "subnetwork": "projects/**-****-****/regions/us-east1/subnetworks/default"
          }
        ],
        "disks": [
          {
            "type": "PERSISTENT",
            "mode": "READ_WRITE",
            "deviceName": "dask-8ef7b12a-scheduler",
            "boot": true,
            "initializeParams": {
              "sourceImage": "projects/ubuntu-os-cloud/global/images/ubuntu-minimal-1804-bionic-v20201014",
              "diskSizeGb": "50",
              "diskType": "projects/**-****-****/zones/us-east1-c/diskTypes/pd-standard"
            },
            "autoDelete": true
          }
        ],
        "serviceAccounts": [
          {
            "email": "default",
            "scopes": [
              "https://www.googleapis.com/auth/devstorage.read_write",
              "https://www.googleapis.com/auth/logging.write",
              "https://www.googleapis.com/auth/monitoring.write"
            ]
          }
        ],
        "scheduling": {
          "onHostMaintenance": "TERMINATE",
          "automaticRestart": true,
          "preemptible": false
        },
        "labels": [
          {
            "key": "container_vm",
            "value": "dask-cloudprovider"
          }
        ],
        "deletionProtection": false,
        "reservationAffinity": {
          "consumeReservationType": "ANY_ALLOCATION"
        },
        "displayDevice": {
          "enableDisplay": false
        },
        "shieldedInstanceConfig": {
          "enableSecureBoot": false,
          "enableVtpm": true,
          "enableIntegrityMonitoring": true
        },
        "@type": "type.googleapis.com/compute.instances.insert"
      },
      "response": {
        "id": "6369220010983126039",
        "name": "operation-1683540728152-5fb2bdf110052-8d9223d7-901da2c1",
        "zone": "https://www.googleapis.com/compute/v1/projects/**-****-****/zones/us-east1-c",
        "operationType": "insert",
        "targetLink": "https://www.googleapis.com/compute/v1/projects/**-****-****/zones/us-east1-c/instances/dask-8ef7b12a-scheduler",
        "targetId": "7446392160937931799",
        "status": "RUNNING",
        "user": "*******-**@**-****-****.iam.gserviceaccount.com",
        "progress": "0",
        "insertTime": "2023-05-08T03:12:08.909-07:00",
        "startTime": "2023-05-08T03:12:08.910-07:00",
        "selfLink": "https://www.googleapis.com/compute/v1/projects/**-****-****/zones/us-east1-c/operations/operation-1683540728152-5fb2bdf110052-8d9223d7-901da2c1",
        "selfLinkWithId": "https://www.googleapis.com/compute/v1/projects/**-****-****/zones/us-east1-c/operations/6369220010983126039",
        "@type": "type.googleapis.com/operation"
      },
      "resourceLocation": {
        "currentLocations": [
          "us-east1-c"
        ]
      }
    },
    "insertId": "-g2r0xuejrjyu",
    "resource": {
      "type": "gce_instance",
      "labels": {
        "zone": "us-east1-c",
        "project_id": "**-****-****",
        "instance_id": "7446392160937931799"
      }
    },
    "timestamp": "2023-05-08T10:12:08.181048Z",
    "severity": "NOTICE",
    "logName": "projects/**-****-****/logs/cloudaudit.googleapis.com%2Factivity",
    "operation": {
      "id": "operation-1683540728152-5fb2bdf110052-8d9223d7-901da2c1",
      "producer": "compute.googleapis.com",
      "first": true
    },
    "receiveTimestamp": "2023-05-08T10:12:09.510022825Z"
  },
  {
    "insertId": "0",
    "jsonPayload": {
      "@type": "type.googleapis.com/cloud_integrity.IntegrityEvent",
      "startupEvent": {},
      "bootCounter": "1"
    },
    "resource": {
      "type": "gce_instance",
      "labels": {
        "project_id": "**-****-****",
        "zone": "us-east1-c",
        "instance_id": "7446392160937931799"
      }
    },
    "timestamp": "2023-05-08T10:12:12.705225299Z",
    "severity": "NOTICE",
    "logName": "projects/**-****-****/logs/compute.googleapis.com%2Fshielded_vm_integrity",
    "receiveTimestamp": "2023-05-08T10:12:14.718245835Z"
  },
  {
    "protoPayload": {
      "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
      "authenticationInfo": {
        "principalEmail": "*******-**@**-****-****.iam.gserviceaccount.com",
        "serviceAccountKeyName": "//iam.googleapis.com/projects/**-****-****/serviceAccounts/*******-**@**-****-****.iam.gserviceaccount.com/keys/*********************************",
        "principalSubject": "serviceAccount:*******-**@**-****-****.iam.gserviceaccount.com"
      },
      "requestMetadata": {
        "callerIp": "**.**.***.***",
        "callerSuppliedUserAgent": "(gzip),gzip(gfe)",
        "requestAttributes": {},
        "destinationAttributes": {}
      },
      "serviceName": "compute.googleapis.com",
      "methodName": "v1.compute.instances.insert",
      "resourceName": "projects/**-****-****/zones/us-east1-c/instances/dask-8ef7b12a-scheduler",
      "request": {
        "@type": "type.googleapis.com/compute.instances.insert"
      }
    },
    "insertId": "-crcxord8p9k",
    "resource": {
      "type": "gce_instance",
      "labels": {
        "zone": "us-east1-c",
        "project_id": "**-****-****",
        "instance_id": "7446392160937931799"
      }
    },
    "timestamp": "2023-05-08T10:12:13.459567Z",
    "severity": "NOTICE",
    "logName": "projects/**-****-****/logs/cloudaudit.googleapis.com%2Factivity",
    "operation": {
      "id": "operation-1683540728152-5fb2bdf110052-8d9223d7-901da2c1",
      "producer": "compute.googleapis.com",
      "last": true
    },
    "receiveTimestamp": "2023-05-08T10:12:13.681857331Z"
  },
  {
    "insertId": "1",
    "jsonPayload": {
      "@type": "type.googleapis.com/cloud_integrity.IntegrityEvent",
      "bootCounter": "1",
      "earlyBootReportEvent": {
        "policyMeasurements": [
          {
            "value": "0KUbR/mg3+8bwW0xjQtbxWh/1gI=",
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_0"
          },
          {
            "pcrNum": "PCR_7",
            "hashAlgo": "SHA1",
            "value": "jwk4ZGvqD/g7cbCA762EALidNFw="
          }
        ],
        "policyEvaluationPassed": true,
        "actualMeasurements": [
          {
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_0",
            "value": "0KUbR/mg3+8bwW0xjQtbxWh/1gI="
          },
          {
            "pcrNum": "PCR_1",
            "hashAlgo": "SHA1",
            "value": "KtcGVWNIdo9pYPgCGcLsoGrtI6Q="
          },
          {
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_2",
            "value": "sqg7Dr8vg3Qpmlsr38MeqVWtcjY="
          },
          {
            "pcrNum": "PCR_3",
            "hashAlgo": "SHA1",
            "value": "sqg7Dr8vg3Qpmlsr38MeqVWtcjY="
          },
          {
            "value": "oP1c2wBIal+pGhF9lzBYFCXpPfM=",
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_4"
          },
          {
            "value": "AuI+039CzYHi8qp4KyhiWOouLcs=",
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_5"
          },
          {
            "hashAlgo": "SHA1",
            "value": "sqg7Dr8vg3Qpmlsr38MeqVWtcjY=",
            "pcrNum": "PCR_6"
          },
          {
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_7",
            "value": "jwk4ZGvqD/g7cbCA762EALidNFw="
          }
        ]
      }
    },
    "resource": {
      "type": "gce_instance",
      "labels": {
        "instance_id": "7446392160937931799",
        "zone": "us-east1-c",
        "project_id": "**-****-****"
      }
    },
    "timestamp": "2023-05-08T10:12:14.078796466Z",
    "severity": "NOTICE",
    "logName": "projects/**-****-****/logs/compute.googleapis.com%2Fshielded_vm_integrity",
    "receiveTimestamp": "2023-05-08T10:12:14.718245835Z"
  },
  {
    "insertId": "2",
    "jsonPayload": {
      "lateBootReportEvent": {
        "policyEvaluationPassed": true,
        "policyMeasurements": [
          {
            "value": "0KUbR/mg3+8bwW0xjQtbxWh/1gI=",
            "pcrNum": "PCR_0",
            "hashAlgo": "SHA1"
          },
          {
            "pcrNum": "PCR_4",
            "value": "vWn2UuE90y1JkDVGJ0AftI2wUbY=",
            "hashAlgo": "SHA1"
          },
          {
            "value": "jwk4ZGvqD/g7cbCA762EALidNFw=",
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_7"
          }
        ],
        "actualMeasurements": [
          {
            "value": "0KUbR/mg3+8bwW0xjQtbxWh/1gI=",
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_0"
          },
          {
            "pcrNum": "PCR_1",
            "value": "KtcGVWNIdo9pYPgCGcLsoGrtI6Q=",
            "hashAlgo": "SHA1"
          },
          {
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_2",
            "value": "sqg7Dr8vg3Qpmlsr38MeqVWtcjY="
          },
          {
            "value": "sqg7Dr8vg3Qpmlsr38MeqVWtcjY=",
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_3"
          },
          {
            "pcrNum": "PCR_4",
            "value": "vWn2UuE90y1JkDVGJ0AftI2wUbY=",
            "hashAlgo": "SHA1"
          },
          {
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_5",
            "value": "B7aW7ovPgQJoFsfmRrWtFUGU17I="
          },
          {
            "value": "sqg7Dr8vg3Qpmlsr38MeqVWtcjY=",
            "pcrNum": "PCR_6",
            "hashAlgo": "SHA1"
          },
          {
            "hashAlgo": "SHA1",
            "pcrNum": "PCR_7",
            "value": "jwk4ZGvqD/g7cbCA762EALidNFw="
          }
        ]
      },
      "@type": "type.googleapis.com/cloud_integrity.IntegrityEvent",
      "bootCounter": "1"
    },
    "resource": {
      "type": "gce_instance",
      "labels": {
        "zone": "us-east1-c",
        "project_id": "**-****-****",
        "instance_id": "7446392160937931799"
      }
    },
    "timestamp": "2023-05-08T10:12:29.032293358Z",
    "severity": "NOTICE",
    "logName": "projects/**-****-****/logs/compute.googleapis.com%2Fshielded_vm_integrity",
    "receiveTimestamp": "2023-05-08T10:12:31.041739950Z"
  }
]

Hi @toanlekafi, welcome here.

I found this link: Google Cloud Platform — Dask Cloud Provider 2021.6.0+48.gf1965ad documentation. But it’s true that it is not detailed. Do you see some authorizations rejections in your log?

I’m not sure of how dask-cloudprovider works, but maybe it waits for the Scheduler to be fully started before creating Worker instances. So if the Scheduler is not accessible, nothing else would happen.

cc @jacobtomlinson.

Yes it waits fore the scheduler to be accessible for launching the workers. This is a limitation of the SpecCluster class in distributed. The problem is usually firewall rules allowing access to the scheduler.

Once your cluster is hung in that state can you check that you can access the dashboard on the scheduler VM?

Thank @jacobtomlinson and @guillaumeeb for the reply.

I added these four VPC rules and got some progress:

  • egress 0.0.0.0/0 on all ports for downloading docker images and general data access
  • ingress 10.0.0.0/8 on all ports for internal communication of workers
  • ingress 0.0.0.0/0 on 8786-8787 for external accessibility of the dashboard/scheduler
  • (optional) ingress 0.0.0.0./0 on 22 for ssh access

Worker instance is created.

But then got asyncio.exceptions.CancelledError.

Trace:

CancelledError                            Traceback (most recent call last)
Cell In[26], line 6
      1 with GCPCluster(n_workers=1, zone='asia-northeast1-c', machine_type='n2-standard-2') as cluster:
      2     with Client(cluster) as client:
----> 3         print(da.random.random((1000, 1000), chunks=(100, 100)).mean().compute())

File ~/miniconda3/envs/data_env/lib/python3.9/site-packages/dask/base.py:315, in DaskMethodsMixin.compute(self, **kwargs)
    291 def compute(self, **kwargs):
    292     """Compute this dask collection
    293 
    294     This turns a lazy Dask collection into its in-memory equivalent.
   (...)
    313     dask.base.compute
    314     """
--> 315     (result,) = compute(self, traverse=False, **kwargs)
    316     return result

File ~/miniconda3/envs/data_env/lib/python3.9/site-packages/dask/base.py:603, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    600     keys.append(x.__dask_keys__())
    601     postcomputes.append(x.__dask_postcompute__())
--> 603 results = schedule(dsk, keys, **kwargs)
    604 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

File ~/miniconda3/envs/data_env/lib/python3.9/site-packages/distributed/client.py:3000, in Client.get(self, dsk, keys, workers, allow_other_workers, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   2998         should_rejoin = False
   2999 try:
-> 3000     results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   3001 finally:
   3002     for f in futures.values():

File ~/miniconda3/envs/data_env/lib/python3.9/site-packages/distributed/client.py:2174, in Client.gather(self, futures, errors, direct, asynchronous)
   2172 else:
   2173     local_worker = None
-> 2174 return self.sync(
   2175     self._gather,
   2176     futures,
   2177     errors=errors,
   2178     direct=direct,
   2179     local_worker=local_worker,
   2180     asynchronous=asynchronous,
   2181 )

File ~/miniconda3/envs/data_env/lib/python3.9/site-packages/distributed/utils.py:338, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    336     return future
    337 else:
--> 338     return sync(
    339         self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    340     )

File ~/miniconda3/envs/data_env/lib/python3.9/site-packages/distributed/utils.py:405, in sync(loop, func, callback_timeout, *args, **kwargs)
    403 if error:
    404     typ, exc, tb = error
--> 405     raise exc.with_traceback(tb)
    406 else:
    407     return result

File ~/miniconda3/envs/data_env/lib/python3.9/site-packages/distributed/utils.py:378, in sync.<locals>.f()
    376         future = asyncio.wait_for(future, callback_timeout)
    377     future = asyncio.ensure_future(future)
--> 378     result = yield future
    379 except Exception:
    380     error = sys.exc_info()

File ~/miniconda3/envs/data_env/lib/python3.9/site-packages/tornado/gen.py:762, in Runner.run(self)
    759 exc_info = None
    761 try:
--> 762     value = future.result()
    763 except Exception:
    764     exc_info = sys.exc_info()

File ~/miniconda3/envs/data_env/lib/python3.9/site-packages/distributed/client.py:2038, in Client._gather(self, futures, errors, direct, local_worker)
   2036     else:
   2037         raise exception.with_traceback(traceback)
-> 2038     raise exc
   2039 if errors == "skip":
   2040     bad_keys.add(key)

CancelledError: ('mean_agg-aggregate-f86fb7189c2aca41759d4514c3ffbb30',)

The code works fine using a LocalCluser, e.g.:

from distributed import LocalCluster, Client
import dask.array as da

with LocalCluster() as cluster:
    with Client(cluster) as client:
        print(da.random.random((1000, 1000), chunks=(100, 100)).mean().compute())

Does the error comes right after the Dask cluster is spawned? Can you open the Dashboard and have some insights of what might be happening?