Scheduler not saturating workers

nickvazz · July 31, 2023, 9:33pm

Hello,

I have been having very bad issues with the dask scheduler. I recently upgraded from dask/distributed version: 2022.6.1-py3.8 to 2023.7.0-py3.9 (while solving Workers constantly dying). I have been using dask to call subprocesses for long running exe simulations which worked quite well. After upgrading it seems to not behave even in the slightest.

No matter what settings I change, I can not seem to keep all workers working. It seems like the scheduler puts jobs on specific workers, and leave other workers completely idle until I resubmit my job workload. It seems to start off okay but eventually most workers start to sit idle. I have tried tuning some of the scheduler configuration to various values with no luck. Some of these probably have zero effect but these were the values I initially started varying.

os.environ['DASK_DISTRIBUTED__SCHEDULER__ACTIVE_MEMORY_MANAGER__START'] = 'True'
os.environ['DASK_DISTRIBUTED__SCHEDULER__ACTIVE_MEMORY_MANAGER__INTERVAL'] = '1s'
os.environ['DASK_DISTRIBUTED__SCHEDULER__ACTIVE_MEMORY_MANAGER__MEASURE'] = 'optimistic'
os.environ['DASK_DISTRIBUTED__COMM__RETRY__COUNT'] = '0' # Default == 0
os.environ['DASK_DISTRIBUTED__COMM__RETRY__DELAY__MIN'] = '1s' # default == '1s'
os.environ['DASK_DISTRIBUTED__DEPLOY__LOST-WORKER-TIMEOUT'] = '15s' # default == '15s'
os.environ['DASK_DISTRIBUTED__SCHEDULER__ALLOWED-FAILURES'] = '3' # default == 3
os.environ['DASK_DISTRIBUTED__SCHEDULER__WORKER-TTL'] = '5 minutes' # default == '5 minutes'
os.environ['DASK_DISTRIBUTED__SCHEDULER__WORK_STEALING'] = "True"
os.environ['DASK_DISTRIBUTED__SCHEDULER__DEFAULT_TASK_DURATIONS__SWEEP'] = '5min'
os.environ['DASK_DISTRIBUTED__SCHEDULER__WORKER_SATURATION'] = '1.1'
os.environ['DASK_DISTRIBUTED__SCHEDULER__UNKNOWN_TASK_DURATION'] = '1s'
os.environ['DASK_DISTRIBUTED__SCHEDULER__TRANSITION_LOG_LENGTH'] = '10000'
os.environ['DASK_DISTRIBUTED__SCHEDULER__EVENTS_LOG_LENGTH'] = '10000'
os.environ['DASK_DISTRIBUTED__SCHEDULER__DASHBOARD__STATUS__TASK_STREAM_LENGTH'] = '500'
os.environ['DASK_DISTRIBUTED__SCHEDULER__DASHBOARD__TASKS__TASK_STREAM_LENGTH'] = '50000'

Currently

The past

I have come across a lot of issues/documentation that might be helpful.

Similar to my problem I think

github.com/dask/distributed

Root-ish tasks all schedule onto one worker

opened 02:15AM - 14 Jun 22 UTC

closed 04:25PM - 18 Oct 22 UTC

gjoseph92

bug performance scheduling stealing

```python import time import dask import distributed client = distributed.…Client(n_workers=4, threads_per_worker=1) root = dask.delayed(lambda n: "x" * n)(dask.utils.parse_bytes("1MiB"), dask_key_name="root") results = [dask.delayed(lambda *args: None)(root, i) for i in range(10000)] dask.compute(results) ``` Initially a few `results` tasks run on other workers, but after about .5 sec, all tasks are just running on a single worker and the other three are idle. ![Screen Shot 2022-06-13 at 8 08 48 PM](https://user-images.githubusercontent.com/3309802/173478512-4dc5332e-c577-4159-90b1-94de3ef0f52e.png) I would have expected these tasks to be evenly assigned to all workers up front Some variables to play with: * If the size of the root task is smaller, tasks will be assigned to other workers * If you remove `dask_key_name="root"`, then _all_ tasks (including the root) will all run on the same worker. I assume this is because they have similar same key names (`lambda`) and therefore the same task group, and some scheduling heuristics are based not on graph structure but on naming heuristics Distributed version: 2022.6.0

Came across but unsure if related

github.com/dask/distributed

A brief pause can cause severe work unbalancing

opened 02:37PM - 03 Apr 23 UTC

crusaderky

memory

# Executive summary A worker may end up being completely unused for the best pa…rt of a computation if it was briefly paused at some point in its early stages. Work stealing does not correct unbalances caused by a worker pausing and then later unpausing. It is unwise to immediately exclude a worker from scheduling heuristics as soon as it pauses. # Expected (naive) behaviour When a worker reaches 85% process memory usage, it is paused. Its memory bar becomes red in the GUI. It continues computing any currently running tasks, but doesn't start any new ones that are in the worker-side queue. It is excluded from task assignment from the scheduler - which means new tasks as well as queued rootish tasks. As soon as process memory falls back below 85% (without any hysteresis cycle), it goes back to running state. It starts any tasks that were in ready state in the worker-side queue, and the scheduler starts sending it new tasks again. Work stealing takes care of rebalancing the workload. # What actually happens If a worker is paused - even for just a fraction of a second - while a large amount of independent, but not rootish, tasks land on the scheduler, then it will be excluded from scheduling and all tasks will be sent at once to other workers. **Work stealing does not rebalance anything after the worker unpauses.** # Use case `test_spilling` in coiled_runtime is as follows: ```python # 64 GiB; running on a cluster of 5x8 GiB workers a = da.random.random((92682, 92682)) a = a.persist() distributed.wait(a) b = a.sum().persist() del a b.compute() ``` The workflow is divided in three stages: 1. (map) A wealth of rootish, independent tasks land on the scheduler at the same time. They produce more data than the cluster can hold in memory, which causes heavy spilling. Because they are rootish, they are sent to the workers at most 3 at a time. 2. (map) Once the tasks are all in memory (or on disk), a wealth of *non-rootish*, independent tasks causes the previous tasks to be unspilled. As soon as each task from step 1 is consumed by step 2, it is released. Because they are non-rootish, they are sent to the workers queues all at once, and subsequently rely on work stealing for optimal balancing. 3. (reduce) Step 2 produces trivially sized output chunks, which are recursively aggregated. In main, this use case never reaches the pause threshold. The reason is that there's modest amounts of unmanaged memory involved, so well before your process memory hits the `pause` threshold, your *managed* memory hits the `target` threshold. This in turns blocks the event loop, which effectively puts a hard limit to how many tasks are in managed memory at any given time. In the video below we can see this effect, as well as the clear split between phase 1 and 2-3: https://user-images.githubusercontent.com/6213168/229532293-2dbbf5ef-70ac-4655-af55-7ff687607773.mp4 #4424 however makes it a lot easier (by design) to reach the pause threshold. What the PR does is that a task that transitions from executing to memory will not cause the event loop to block until older tasks are spilled out; if the next task at the top of the ready queue does not have any spilled-out dependencies, it will start immediately while spilling/unspilling happens in a separate thread. In the video below we can see that tasks in stage 1 very quickly reach the pause threshold. This is because `numpy.random.random` produces data faster than the disk write throughput on the Coiled workers (AWS EC2 m6i.large) can consume. The workers then start flickering in and out of paused state, as the disk catches up. **This behaviour is good** and by design - this is exactly how the `pause` system should work, and it means that the worker will start slowing down later, after a higher degree of memory pressure. However, when phase 2 kicks in, most workers are still spilling in the background and are still paused. While this lasts for just a second or two, that's enough for the scheduler to completely exclude them when scheduling many hundreds of non-rootish tasks. When they later go back to running state, work stealing does not rebalance the queue, causing a dramatic degradation in end-to-end runtime. https://user-images.githubusercontent.com/6213168/229539118-84365240-65fd-47ba-85cc-a75134da5706.mp4 ![Screenshot from 2023-04-03 15-34-50](https://user-images.githubusercontent.com/6213168/229541876-76e384f0-8fb7-42fa-bae8-39fb410143bf.png) I tried adding a brief pause on the client side between phase 1 and phase 2, waiting for all workers to unpause, and that caused the problem to disappear, achieving perfect work balance: https://user-images.githubusercontent.com/6213168/229539869-09a49a1d-136e-40c6-bf1d-15d94f592cff.mp4 # Proposed actions First of all, we need to figure out why work stealing is not kicking in. This is clearly a bug. Second, however, there is a consensus among developers that work stealing should be avoided if possible. To this extent, I propose to introduce a new worker state, `pausing`. A worker enters `pausing` state when it passes the `pause` threshold. After a quite generous amount of time - e.g. 10s - the worker transitions to `paused` if it didn't fall below the `pause` threshold. This resets the timer. From the worker's point of view, `pausing` is the same as paused: tasks will be left in the `ready` queued, and network transfers are trottled. From the scheduler's perspective, `pausing` is the same as `running`: it will not be excluded from scheduling heuristics and it will remain in the `running` set. # XREFs - #3761 - #5999 - #4424

github.com/dask/distributed

Restart paused workers after a certain timeout

opened 10:38AM - 25 Mar 22 UTC

crusaderky

feature memory

## Use case Workers pause once they hit `distributed.worker.memory.pause` thr…eshold (default 80%). This is to allow a worker to spill *managed* data to disk to free up memory to consume computation; in other words when the tasks are producing managed data faster than they can be spilled to the backing disk. However, if memory is full due to unmanaged memory, the worker will never be unpaused and remain a deadweight forever. In real life, this typically happens when a library leaks memory. ## Design If a worker remains paused beyond a certain user-defined timeout, restart it through graceful worker retirement. This implies moving all of its managed data out of the spill file and to other workers. [EDIT] Ideally, this timeout should start ticking either after all managed data has been spilled to disk or data failed to spill (due to disk full or max_spill). ## AC * Workers are retired gracefully if they cannot recover from paused state automatically after a given timeout * Timeout is configurable ## Caveats * If all workers pause simultaneously, e.g. due to gradual memory leak of a library that is called uniformly across the whole cluster, then the cluster may still end up in a blocked state. ## Rejected ideas Persistent spill buffer which is retrieved after restart. This would prevent the above caveat. ## Related - #3761

github.com/dask/distributed

`worker-saturation` impacts balancing in work-stealing

opened 12:07PM - 29 Sep 22 UTC

closed 04:28PM - 10 Nov 22 UTC

hendrikmakait

bug scheduling stealing

When `worker-saturation` is not `inf`, then workers are only classified as idle …if they are not full: https://github.com/dask/distributed/blob/482941ebe6c0d5fd851efd4b193ea3392b7ce4a9/distributed/scheduler.py#L2899-L2903 While this behavior is desired for withholding root-tasks (it was introduced in #6614), work-stealing also relies on the classification of idle tasks to identify thieves. Limiting this to workers that are not saturated according to `worker-saturation` delays balancing decisions until workers are almost out of work and reduces our ability to interleave computation of remaining tasks with gathering dependencies of stolen ones. **Reproducer** _Add the following test case to `test_steal.py`_ ```python3 @pytest.mark.parametrize("queue", [True, False]) @pytest.mark.parametrize("recompute_saturation", [True, False]) @pytest.mark.parametrize( "inp,expected", [ ( [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0]], [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0]], ), # balance many tasks ], ) def test_balance_interacts_with_worker_saturation( inp, expected, queue, recompute_saturation ): async def test_balance_(*args, **kwargs): await assert_balanced(inp, expected, recompute_saturation, *args, **kwargs) config = { "distributed.scheduler.default-task-durations": {str(i): 1 for i in range(10)}, "distributed.scheduler.worker-saturation": 1.0 if queue else float("inf"), } gen_cluster(client=True, nthreads=[("", 1)] * len(inp), config=config)( test_balance_ )() ``` ``` FAILED distributed/tests/test_steal.py::test_balance_interacts_with_worker_saturation[inp0-expected0-True-True] - Exception: Expected: [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0]]; got: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0]] FAILED distributed/tests/test_steal.py::test_balance_interacts_with_worker_saturation[inp0-expected0-False-True] - Exception: Expected: [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0]]; got: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0]] ``` cc @gjoseph92

github.com/dask/distributed

Use cases for work stealing

opened 02:34PM - 20 Jun 22 UTC

fjetter

performance discussion stability scheduling stealing adaptive

Work stealing is a fairly complex machinery intended to redistribute tasks on a …cluster to achieve a homogeneous occupancy, i.e. all workers will be busy for approximately the same time. I'm currently aware of two use cases for it A) Adaptive scaling or generally upscaling scenarios, i.e. adding more workers to a cluster require some kind of load balancing. Without work stealing, a newly added worker would sit idle until the scheduler _might_ assign a task which is not even guaranteed or might work very poorly (e.g. https://github.com/dask/distributed/issues/4471) By having work stealing enabled, we are automatically ensuring that any newly added worker is able to start working since it gets tasks assigned via the stealing mechanism. However, this is known to not work well (list not exhaustive) - https://github.com/dask/distributed/issues/4471 - https://github.com/dask/distributed/issues/5599 B) Another application would be a workload with vastly different runtimes in a TaskGroup. This is particularly concerning if there are few tasks in this task group or the runtime distribution is asymmetrical such that even after running many tasks the runtime differences would not cancel themselves and we'd have few workers with very large queues, effectively extending overall runtime by having a large tail in the computation. **I am not entirely sure if this usecase is actually very relevant and would appreciate some additional information around it**. If this is indeed relevant we may benefit from an improved runtime tracking, e.g. with error measurement (e.g. https://github.com/dask/distributed/pull/4028) in combination with a simpler, more selective algorithm. The current work stealing algorithm has a couple of issues. Currently open issues can be filtered by the label [stealing](https://github.com/dask/distributed/issues?q=is%3Aopen+is%3Aissue+label%3Astealing) Stealing also is known to be a trigger for deadlocks (at least four have been reported and fixed by now) since it requires a handshake that can cause timing issues (see e.g. https://github.com/dask/distributed/pulls?q=is%3Apr+is%3Aclosed+stealing+label%3Astealing+label%3Adeadlock) There are even cases where work stealing is known to cause harm by reverting smart scheduler decisions, e.g. https://github.com/dask/distributed/issues/6573 I'm currently trying to estimate whether we should pursue work stealing and try to make it robust or abandon this extension in favor of a less general but more robust solution for A and possibly B. Thoughts? cc @mrocklin @crusaderky @gjoseph92

github.com/dask/distributed

Poor work scheduling when cluster adapts size

opened 12:01AM - 30 Jan 21 UTC

chrisroat

stealing adaptive

**What happened**: When a cluster is autoscaled in increments while running, …as can happen in a GKE cluster, the work concentrates on few workers and uses the cluster inefficiently. This seems to be worse when there are long running tasks. The example below simulates this by adjusting a local cluster's size as it is processing a graph with 10-second tasks. The image below shows the final look of the task graph, and the animated gif shows the status screen as the cluster processes the graph. Many workers do zero long tasks, and only a few workers seem to be fully utilised. If the cluster is initially set to 20 workers with no changes, work is distributed evenly and all workers are efficiently used. @fjetter In this example, the workers under load are green a lot. **What you expected to happen**: After the cluster scales up, the work should be evenly divided among all the workers. **Minimal Complete Verifiable Example**: ```python import time import distributed import dask.array as da client = distributed.Client(n_workers=1, threads_per_worker=1, memory_limit='1G') print(client.dashboard_link) # Slow task. def func1(chunk): if sum(chunk.shape) != 0: # Make initialization fast time.sleep(10) return chunk def func2(chunk): return chunk data = da.zeros((30, 30, 30), chunks=5) result = data.map_overlap(func1, depth=1, dtype=data.dtype) result = result.map_overlap(func2, depth=1, dtype=data.dtype) future = client.compute(result) print('started computation') time.sleep(11) print('scaling to 4 workers') client.cluster.scale(4) time.sleep(5) print('scaling to 20 workers') client.cluster.scale(20) _ = future.result() ``` **Anything else we need to know?**: In a real cluster with my real load (not in the simulation above), I also will see the scheduler CPU pegged near 100% (possibly due to #3898), even when all workers are working on the long tasks. This seems odd, since nothing is being actively changed in the scheduling. **Environment**: - Dask (and distributed) version: 2021.01.1 - Python version: 3.8.5 - Operating System: Ubuntu 20.04 - Install method (conda, pip, source): pip Final task screenshot: <img width="501" alt="Screen Shot 2021-01-30 at 7 23 00 AM" src="https://user-images.githubusercontent.com/1053153/106337396-e9d13700-6245-11eb-97d9-05b73799610c.png"> Movie of workload (at 10x speed): ![Kapture 2021-01-30 at 07 28 34](https://user-images.githubusercontent.com/1053153/106339148-8e557800-624a-11eb-938a-f83468a9b7e8.gif)

github.com/dask/distributed

Assign tasks to idle workers rather than queuing them

opened 02:26PM - 08 Dec 22 UTC

NakulK48

In our workflow, an HTTP server receives a request containing a payload for proc…essing, starts a Dask cluster, splits the payload into batches and then submits each batch via Dask to the cluster (via fire_and_forget). The workers are on EC2 (via Kubernetes) so naturally they don’t start up instantly - an instance needs to be found, it needs to download the Docker image, etc. I just tried a small run which had 7 batches, and spun up a 7-worker cluster to run them. Intriguingly, even though all 7 machines were up and ready, only 3 of them were actually processing - 3 of the tasks were completed in parallel and then the next 3. Looking at the logs of the other 4 workers showed that nothing was happening at all. I figure the scheduler must somehow be assigning all of these tasks to whichever machine happens to be available at the time they are submitted - but this doesn’t strike me as the right behaviour at all. Surely it’s only when the task is about to start that a worker node should be selected. So if only 3 nodes are up, the task has to wait to begin, but if an idle node is available, execution should start immediately on that node.

github.com/dask/distributed

Round-robin worker selection makes poor choices with `worker-saturation > 1.0`

opened 01:48PM - 26 Oct 22 UTC

gjoseph92

bug scheduling

`test_wait_first_completed` is failing in https://github.com/dask/distributed/pu…ll/7191, with the `worker-saturation` value set to 1.1 https://github.com/dask/distributed/blob/09837312ccb3a9f1a14ea068ba5825963fec82cc/distributed/tests/test_client.py#L732-L746 It works fine with 1.0, but because of the round-up logic https://github.com/dask/distributed/pull/7116 allowing workers to be oversaturated, fails for 1.1 It blocks forever because the worker with 1 thread gets assigned `[block_on_event, inc]`, and the worker with 2 threads gets assigned `[block_on_event]`. It should be the other way around. The culprit has something to do with the round-robin logic that only applies to rare situations like this, where the cluster is small but larger than the TaskGroup being assigned https://github.com/dask/distributed/blob/09837312ccb3a9f1a14ea068ba5825963fec82cc/distributed/scheduler.py#L2210-L2236 If I update `is_rootish` like so: ```diff diff --git a/distributed/scheduler.py b/distributed/scheduler.py index cf240240..802df12d 100644 --- a/distributed/scheduler.py +++ b/distributed/scheduler.py @@ -3043,6 +3043,8 @@ class SchedulerState: """ if ts.resource_restrictions or ts.worker_restrictions or ts.host_restrictions: return False + if not ts.dependencies: + return True tg = ts.group # TODO short-circuit to True if `not ts.dependencies`? return ( ``` the test passes. cc @fjetter @crusaderky

github.com/dask/distributed

Simplify decide_worker

dask:main ← fjetter:simplify_decide_worker

opened 12:24PM - 30 Aug 22 UTC

fjetter

+6 -30

This came up during review of https://github.com/dask/distributed/pull/6614#disc…ussion_r958287053 Bottom line is that this code path is only there for performance optimization and it approximates the decision performed by decide worker (it neglects held memory / ws.nbytes and ignores inhomogeneous nthreads), i.e. the decision quality is strictly better when using `worker_objective`. In these situations, poor decision are typically not a big deal, though. This code path is in a real world scenario actually pretty difficult to hit since we introduced the root task logic above. Most tasks that do not hold dependencies will follow the root task decision path, unless the group is too small to properly utilize the cluster, i.e. #tasks < #total_threads I performed a couple of micro benchmarks on my machine (basically i extracted the methods to be a function and ran it on a couple of dicts) This is the measurement I got on my machine. This is the time it takes to make the worker decision for 1k Tasks. | N Workers | main | This PR | This PR + plain dict | |-----------:|---------:|---------:| ---: | | 10k | const; see below | 979 ms | (-51%) 484 ms | | 1k | 1.3 ms | 112 ms | (-47%) 63.7 ms | | 100 | 1.21 ms | 26.4 ms | (-18%) 21.6 ms | | 19 | 3.37 ms | 6.48 ms | (-21%) 5.1 ms | Basically, we'd slow down embarrassingly parallel submissions, e.g. `client.map(inc, range(1000))` by ~100ms if scheduled on a 1k workers cluster. Is this worth the optimization? As I said, most real world workloads that resemble this will very likely go down the root task path anyhow <details> <summary>Code to reproduce</summary> ```python from functools import partial return (start_time, ws.nbytes) def decide_worker(workers, ts, idle=set(),total_nthreads=10000, n_tasks=0): """ Decide on a worker for task *ts*. Return a WorkerState. If it's a root or root-like task, we place it with its relatives to reduce future data tansfer. If it has dependencies or restrictions, we use `decide_worker_from_deps_and_restrictions`. Otherwise, we pick the least occupied worker, or pick from all workers in a round-robin fashion. """ from distributed.scheduler import decide_worker as decide_worker_scheduler tg = ts.group valid_workers = None #set(workers.values()) # Group is larger than cluster with few dependencies? # Minimize future data transfers. if ( valid_workers is None and len(tg) > total_nthreads * 2 and len(tg.dependencies) < 5 and sum(map(len, tg.dependencies)) < 5 ): ws = tg.last_worker if not (ws and tg.last_worker_tasks_left and ws.address in workers): # Last-used worker is full or unknown; pick a new worker for the next few tasks ws = min( (idle or workers).values(), key=partial(worker_objective, ts), ) assert ws tg.last_worker_tasks_left = math.floor( (len(tg) / total_nthreads) * ws.nthreads ) # Record `last_worker`, or clear it on the final task tg.last_worker = ( ws if tg.states["released"] + tg.states["waiting"] > 1 else None ) tg.last_worker_tasks_left -= 1 return ws if ts.dependencies or valid_workers is not None: ws = decide_worker_scheduler( ts, workers.values(), valid_workers, partial(worker_objective, ts), ) else: # Fastpath when there are no related tasks or restrictions worker_pool = idle or workers wp_vals = worker_pool.values() n_workers: int = len(wp_vals) if n_workers < 20: # smart but linear in small case ws = min(wp_vals, key=operator.attrgetter("occupancy")) assert ws if ws.occupancy == 0: # special case to use round-robin; linear search # for next worker with zero occupancy (or just # land back where we started). wp_i: WorkerState start: int = n_tasks % n_workers i: int for i in range(n_workers): wp_i = wp_vals[(i + start) % n_workers] if wp_i.occupancy == 0: ws = wp_i break else: # dumb but fast in large case ws = wp_vals[n_tasks % n_workers] return ws def worker_objective(ts, ws) -> tuple: """ Objective function to determine which worker should get the task Minimize expected start time. If a tie then break with data storage. """ dts: TaskState comm_bytes: int = 0 for dts in ts.dependencies: if ws not in dts.who_has: nbytes = dts.get_nbytes() comm_bytes += nbytes stack_time: float = ws.occupancy / ws.nthreads start_time: float = stack_time + comm_bytes / 100_000_000 if ts.actor: return (len(ws.actors), start_time, ws.nbytes) else: return (start_time, ws.nbytes) def decide_worker_simplified(workers, ts, idle=set(),total_nthreads=100, n_tasks=0): """ Decide on a worker for task *ts*. Return a WorkerState. If it's a root or root-like task, we place it with its relatives to reduce future data tansfer. If it has dependencies or restrictions, we use `decide_worker_from_deps_and_restrictions`. Otherwise, we pick the least occupied worker, or pick from all workers in a round-robin fashion. """ from distributed.scheduler import decide_worker as decide_worker_scheduler tg = ts.group valid_workers = set(workers.values()) # Group is larger than cluster with few dependencies? # Minimize future data transfers. if ( valid_workers is None and len(tg) > total_nthreads * 2 and len(tg.dependencies) < 5 and sum(map(len, tg.dependencies)) < 5 ): print("Root task stuff!!") ws = tg.last_worker if not (ws and tg.last_worker_tasks_left and ws.address in workers): # Last-used worker is full or unknown; pick a new worker for the next few tasks ws = min( (idle or workers).values(), key=partial(worker_objective, ts), ) assert ws tg.last_worker_tasks_left = math.floor( (len(tg) / total_nthreads) * ws.nthreads ) # Record `last_worker`, or clear it on the final task tg.last_worker = ( ws if tg.states["released"] + tg.states["waiting"] > 1 else None ) tg.last_worker_tasks_left -= 1 return ws ws = decide_worker_scheduler( ts, workers.values(), valid_workers, partial(worker_objective, ts), ) return ws ``` </details> cc @gjoseph92

github.com/dask/distributed

Revert idle classification when worker-saturation is set

dask:main ← fjetter:revert_idle_classification_rootish_tasks

opened 01:51PM - 09 Nov 22 UTC

fjetter

+24 -13

This restores the behavior of the idle set to the behavior before https://github….com/dask/distributed/pull/6614 when worker-saturation is set. All code paths in the new queuing path will use a different idleness set based on a different definition of idle. Closes https://github.com/dask/distributed/issues/7085 ref https://github.com/dask/distributed/pull/7191

github.com/dask/distributed

`worker-saturation` impacts balancing in work-stealing

opened 12:07PM - 29 Sep 22 UTC

closed 04:28PM - 10 Nov 22 UTC

hendrikmakait

bug scheduling stealing

When `worker-saturation` is not `inf`, then workers are only classified as idle …if they are not full: https://github.com/dask/distributed/blob/482941ebe6c0d5fd851efd4b193ea3392b7ce4a9/distributed/scheduler.py#L2899-L2903 While this behavior is desired for withholding root-tasks (it was introduced in #6614), work-stealing also relies on the classification of idle tasks to identify thieves. Limiting this to workers that are not saturated according to `worker-saturation` delays balancing decisions until workers are almost out of work and reduces our ability to interleave computation of remaining tasks with gathering dependencies of stolen ones. **Reproducer** _Add the following test case to `test_steal.py`_ ```python3 @pytest.mark.parametrize("queue", [True, False]) @pytest.mark.parametrize("recompute_saturation", [True, False]) @pytest.mark.parametrize( "inp,expected", [ ( [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0]], [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0]], ), # balance many tasks ], ) def test_balance_interacts_with_worker_saturation( inp, expected, queue, recompute_saturation ): async def test_balance_(*args, **kwargs): await assert_balanced(inp, expected, recompute_saturation, *args, **kwargs) config = { "distributed.scheduler.default-task-durations": {str(i): 1 for i in range(10)}, "distributed.scheduler.worker-saturation": 1.0 if queue else float("inf"), } gen_cluster(client=True, nthreads=[("", 1)] * len(inp), config=config)( test_balance_ )() ``` ``` FAILED distributed/tests/test_steal.py::test_balance_interacts_with_worker_saturation[inp0-expected0-True-True] - Exception: Expected: [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0]]; got: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0]] FAILED distributed/tests/test_steal.py::test_balance_interacts_with_worker_saturation[inp0-expected0-False-True] - Exception: Expected: [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0]]; got: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0]] ``` cc @gjoseph92

github.com/dask/distributed

Queuing does not prevent root task overproduction unless you have enough tasks

opened 06:23PM - 08 Nov 22 UTC

gjoseph92

memory scheduling

Queuing https://github.com/dask/distributed/pull/6614 is meant to prevent root t…ask overproduction https://github.com/dask/distributed/issues/5555. And it's shown to be very effective at doing so: https://github.com/dask/distributed/discussions/7128. However, due to the heuristic of what counts as a "root-ish" task, it'll only stop root task overproduction if you have > `total_nthreads * 2` root tasks. Overproduction can occur any time there are > `total_nthreads` root tasks. So in this middle case, queuing won't kick in and the `worker-saturation` value won't be respected. This would be confusing behavior to users. If you make your problem size smaller, or make your cluster bigger—two things that you'd expect to reduce per-worker memory usage—you may cross an opaque magic threshold at which your workload suddenly uses up to 2x more memory. EDIT: To be clear, I propose a two-character change to fix this. Just drop the `* 2` part: ```diff diff --git a/distributed/scheduler.py b/distributed/scheduler.py index b99e3f19..df20e807 100644 --- a/distributed/scheduler.py +++ b/distributed/scheduler.py @@ -3033,7 +3033,7 @@ class SchedulerState: tg = ts.group # TODO short-circuit to True if `not ts.dependencies`? return ( - len(tg) > self.total_nthreads * 2 + len(tg) > self.total_nthreads and len(tg.dependencies) < 5 and sum(map(len, tg.dependencies)) < 5 ) ``` The `* 2` is a number @mrocklin and I just made up back in https://github.com/dask/distributed/pull/4967. There wasn't any benchmarking or empirical reason for it. Just saying `> nthreads` is more logical and easier to justify.

github.com/dask/distributed

Make root-ish definition static

dask:main ← fjetter:rootish_static

opened 01:18PM - 09 Feb 23 UTC

fjetter

+249 -70

@gjoseph92 The topic of dynamic root task definition were brought up quite frequ…ently in the past. I never figured out why the cluster size has to be part of the classification logic. From everything I understand this should be moved to the actual decision logic and not the classification. The property of being rootish, the way it is defined right now, is actually a static _TaskGroup_ property. In the case of co-assignment, I can understand that we do not want to assign too small task groups but why not handle this case in the decision logic itself instead of the classification? TODO - [ ] Fix two failing tests since they appear to be related - [ ] Clean up test cases

github.com/dask/distributed

Task stealing regression in 2021-11-0+ (preventing task load balancing)

opened 10:48PM - 03 Dec 21 UTC

closed 01:25PM - 10 Dec 21 UTC

arnaudsj

bug

**What happened**: There seems to be an issue with the task scheduling. When the cluster starts, and the first worker grabs the first task, it appears to block the rest of the workers until it completes the first task. This is particularly an issue when large clusters are spun in the cloud, as the scheduler start to send tasks when not all the workers are ready. This minimum working example below also demonstrate how one of the workers appears to keep all the tasks assigned and only distribute them as needed throughout the length of the computation. **What you expected to happen**: Workers should start processing tasks immediatly and be able to steal tasks from the 1st worker. This is appears to be a regresssion from pre dask-2021-11-0 (I confirmed for sure it works as expected in 2021-9-2) **Minimal Complete Verifiable Example**: ```python import numpy as np import time import dask.bag as db def slow_function(input): time.sleep(30) return input bag = db.from_sequence(np.random.rand(1000, 1), npartitions=1000) bag.map(slow_function).compute() ``` **Anything else we need to know?**: This is a minimized example of an issue our team ran into at scale on Coiled (cc @gjoseph92) **Environment**: - Dask version: 2021.11.2 - Python version: 3.9.7 - Operating System: Linux / Coiled - Install method (conda, pip, source): conda

https://distributed.dask.org/en/latest/work-stealing.html
https://distributed.dask.org/en/latest/scheduling-policies.html

Any tips would be greatly appreciated!

guillaumeeb · August 1, 2023, 8:55pm

Hi @nickvazz,

Question is: could you come up with some minimum reproducer?

One of the big changes that might affect you between the two versions is:

You could try to use the previous behavior by using:

with dask.config.set({"distributed.scheduler.worker-saturation": "inf"}):

nickvazz · August 2, 2023, 4:53pm

Hi @guillaumeeb,

Thanks for the quick response! I cant seem to get a minimum reproducer, but this looks like it roughly does the same thing: Root-ish tasks all schedule onto one worker · Issue #6573 · dask/distributed · GitHub but with long running tasks (roughly 3-45 minutes).

I tried

with dask.config.set({"distributed.scheduler.worker-saturation": "inf"}):

But the issue still persists.

Something that might be helpful is a little more information about how I am using the workers. I am starting N-workers on each machine each with one thread and 4GB of memory. The workers themselves never run into memory issues but I think that is because each of these simulations is called as a subprocess from within the worker, so it does not seem to be aware of the memory use of the .exe it is calling. Might this be partially to blame for why backlogged work isnt moving to idle workers?

Another thought I had was that maybe I could force queuing for all tasks greater than the number of workers*threads, i.e. 3 workers, 1 thread each on 5 machines → 15 workers – start tasks 1-15 and queue tasks 15-M. Does that seem possible?

nickvazz · August 2, 2023, 6:13pm

This sort of thing happens

First set of jobs finishes, but then they all get thrown to a specific worker for some reason

nickvazz · August 2, 2023, 6:26pm

nickvazz · August 2, 2023, 6:58pm

guillaumeeb · August 3, 2023, 9:01am

You shouldn’t have to do this.

Could you share at least some code snippet that shows how you are submitting tasks to Dask?

nickvazz · August 3, 2023, 6:29pm

Roughly what I am doing is

import glob
import subprocess
import os

from distributed import Client

client = Client('tcp://scheduler')

def get_priority_of_sim_file(sim_file):
     return priority

def run_sim(sim_file):
     p = subprocess.Popen(f"sim.exe {sim_file}")
     result, error = p.communicate()
     return result

sim_files = glob.glob("/some/directory/*.py")
sim_priorities = list(map(get_priority_of_sim_file, sim_files))

futures = []
for sim_file, priority in zip(sim_files, sim_priorities):
    key = os.path.basename(sim_file)
    future = client.submit(run_sim, sim_file, key=key, priority=priority)
    futures.append(future)

results = client.gather(futures)

crusaderky · August 9, 2023, 10:48am

Hi @nickvazz,
your pseudocode looks very simple. Do you think you could mock the output of get_priority_of_sim_file to make it produce a realistic distribution, as well as mock run_sim with a time.sleep with durations that mimick your actual tasks? Since I’m seeing <100 tasks, you could easily measure the two in your real code and hardcode them in a list.

You definitely should not need to tamper with the dask config or absolutely anything else for this algorithm to work efficiently.

nickvazz · August 9, 2023, 11:20pm

Hi @crusaderky,

Usually we run anywhere from 1k-30k simulations like that taking anywhere from 2 minutes to 30 minutes each over about 250 cores (single threaded).

I tried just force restarting the cluster and resubmitting the jobs in its own subprocess over and over and got about 50% throughput.

I tried switching to a client.map rather than submitting each simulation individually (losing their key info and being about to prioritize them) and it didnt quite work.

Our simulation leaves around some files and if force killed does not clean up after itself (after having hard killed the dask workers a fair number of times). After manualy cleaning these files, it seems to generally work as expected now which is very confusing after a week and a half of it misbehaving. SO I have no idea what was wrong or why it would seemingly hang the cluster. Perhaps the dask-worker folder holds information about task duration? Anyway, thanks for checking in. Hopefully things stay smooth!

Topic		Replies	Views
How to retry hanging jobs during a distributed computation Distributed dask-array , distributed	3	929	May 4, 2022
Adaptive Scaling while not rerunning non-idempotent tasks Distributed	8	223	December 22, 2023
Understanding Work Stealing Distributed	14	4575	March 10, 2022
Tasks forgotten waiting for new workers to be allocated Distributed dask-jobqueue , distributed	8	92	June 6, 2025
Memory Management of Dask Cluster and a few new user questions Distributed distributed	15	1461	March 13, 2024

Currently

The past

Similar to my problem I think

Came across but unsure if related

Related topics