Dask performs recomputation in branched graphs

RRRajput · June 6, 2022, 10:38am

Suppose, I create the following graph:

import dask
import time


@dask.delayed
def step_1():
    print("Running Step 1")
    time.sleep(1)
    return True

@dask.delayed
def step_2(prev_step):
    print("Running Step 2")
    time.sleep(1)
    return True

@dask.delayed
def step_3a(prev_step):
    print("Running Step 3a")
    time.sleep(1)
    return True

@dask.delayed
def step_3b(prev_step):
    print("Running Step 3b")
    time.sleep(1)
    return True

stp_1 = step_1()
stp_2 = step_2(stp_1)
stp_3a = step_3a(stp_2)
stp_3b = step_3b(stp_2)

from dask import visualize

visualize([stp_3a, stp_3b])

from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=1, threads_per_worker=3, dashboard_address="localhost:27998")
client = Client(cluster)
client

Now, I compute step_3a and it should take about 3 seconds.


start = time.perf_counter()

stp_3a_futures = client.compute(stp_3a) # So that the future stays in memory

stp_3a_results = client.gather(stp_3a_futures)

duration = time.perf_counter() - start

print(duration)

[Out]: 3.1600782200694084

This makes sense. But now, when I execute step_3b, I expect it to finish in one second since it has already computed step_1 and step_2. But, unfortunately, it doesn’t keep those two steps in memory and the computation for step_3b also takes 3 seconds:


start = time.perf_counter()

stp_3b_futures = client.compute(stp_3b) # So that the future stays in memory

stp_3b_results = client.gather(stp_3b_futures)

duration = time.perf_counter() - start

print(duration)

[Out]: 3.0438701044768095

Now, my question is:

is there a way to keep step_2 and step_1 in cluster’s memory using ONLY the delayed object of step_3a (i.e., stp_3a)?

I know I can call client.persist() on stp_2 but that’s not the answer I’m looking for. In my use-case, when I’ll be computing step_3a, I won’t have any reference to the delayed object for step_2.

thank you in advance for those of you who can answer.

pavithraes · June 7, 2022, 5:45pm

@RRRajput Welcome to Discourse! And, thanks for a well-phrased question with a minimal example!

Could you please share why you can’t hold on to a reference to step_2? I’m asking because it’s an unusual scenario. We can try out a workaround to construct the step_2 futures from step_3a’s dependencies, but it’s tricky + not generally recommended, so we’d first like to understand if it’s actually needed.

And, I’m assuming dask.compute(stp_3a, stp_3b), which would share intermediates, is also not possible in your case?

Topic		Replies	Views
Editing dask delayed DAG graph nodes after creation Distributed delayed , high-level-graph	3	71	August 19, 2024
Latency between graph constitution and start of calculation Distributed performance	4	333	January 18, 2023
How to avoid re-execution Distributed	6	121	April 24, 2024
Need help with efficient parallelization [local machine] Distributed delayed , distributed	2	247	July 30, 2022
Dask.delayed and custom workflows Distributed kubernetes , delayed , distributed	2	290	February 21, 2024

Dask performs recomputation in branched graphs

Related topics