Only 1 worker is running when the DAG is forking

PierrickPochelu · September 8, 2023, 4:09pm

Hello,

I have three interconnected nodes:

Node-A hosts the scheduler.
Node-B hosts one Dask worker, equipped with 128 cores as resources.
Node-C also hosts one Dask worker, also equipped with 128 cores as resources.

The following code does not distribute the hyperparameters computation between nodes B and C. I would like to understand the reason for this. Is it due to suboptimal scheduler choices, or is there a programming error on my part?

from dask.distributed import Client
import dask
import time
import os
dask.config.set({'distributed.worker.daemon': False})

@dask.delayed
def reading():
    print(f"reading() on {os.environ.get('HOSTNAME')} ...")
    time.sleep(2)
    print(f"reading() done")
    return 0

@dask.delayed
def train(x, h):
    print(f"train(x,{h}) on {os.environ.get('HOSTNAME')} ...")
    time.sleep(2)
    print(f"train(x,{h}) done")
    return x+h

if __name__ == "__main__":
    client = Client("localhost:8786")

    with dask.annotate(resources={"cores": 1}):
        reading_op = reading()

    training_ops = []

    hyperparameters = [4, 8, 16, 32]
    for h in hyperparameters:
        with dask.annotate(resources={"cores": 100}):
            train_op = train(reading_op, h) # the compute is well distributed if I replace training_op with 0 literal
            training_ops.append(train_op)

    start_total_time = time.time()
    out = dask.compute(*training_ops)
    print(out)
    print("compute time:", time.time()-start_total_time)

    client.close()

The 2 workers are successfully running train() in parallel when I remove the need of the reading operation.

crusaderky · September 11, 2023, 10:55am

Looks like a scheduler bug.

github.com/dask/distributed

Imbalanced scheduling of non-root tasks with resources

opened 10:55AM - 11 Sep 23 UTC

crusaderky

needs triage

From https://dask.discourse.group/t/only-1-worker-is-running-when-the-dag-is-for…king/2192 Non-root tasks that declare resources do not evenly distribute on the cluster, instead piling up on a single worker. ```python import time import dask from distributed import Client @dask.delayed def f(): return 1 @dask.delayed def g(x, y): time.sleep(2) return x + y ops = [] root = f() #root = 1 for i in range(4): with dask.annotate(resources={"cores": 100}): nonroot = g(root, i) ops.append(nonroot) with Client(n_workers=2, threads_per_worker=4, resources={"cores": 128}): t0 = time.time() dask.compute(*ops) t1 = time.time() print("compute time:", t1 - t0) ``` Expected: 4s Actual: 8s `distributed.scheduler.worker-saturation` does not seem to make a difference. Having less or more than 5 tasks (the threshold for `is_rootish`) doesn't seem to have an impact (as long as you have less tasks than threads). Uncommenting `root = 1`, thus making the tasks with resources actually root (not just rootish) makes the issue disappear. After increasing the number of tasks from 6 to 100, this is what I see on the dashboard: ![image](https://github.com/dask/distributed/assets/6213168/41161b46-cb26-426a-8619-e6615a3bcdc1)

Topic		Replies	Views
Dask distributed performance issues Distributed kubernetes , future , distributed	1	252	December 7, 2022
Scheduler not saturating workers Distributed future , distributed	9	319	August 9, 2023
Restrict task graph to one worker in a distributed cluster Distributed	0	160	October 4, 2022
How to retry hanging jobs during a distributed computation Distributed dask-array , distributed	3	947	May 4, 2022
Run a single task per worker with dask-mpi Distributed dask-mpi	1	792	September 29, 2022

Only 1 worker is running when the DAG is forking

Related topics