Seemingly more overhead than expected for scheduling tasks

I am interested in running dask on a single machine for serving ML inference code. The request that the production code handles often have parallelizable subtasks, and should generally be completed between 50ms and 1s. I am not interested in any of the scientific computing capabilities of dask, I just appreciate the delayed and Future APIs that allow me to express and parallelize a complex DAG of tasks without radically changing my code. I also like the dashboard: it would let me monitor the execution of incoming queries and profile the slow points of my pipeline. Running the dashboard however means that I have to run the distributed scheduler, even though everything happens on one machine.

I’ve read that the overhead for scheduling a task is about 200-300us for the non-distributed scheduler, and about 1ms for the distributed scheduler [1] [2]. I was initially confused because the dashboard seem to indicate a much larger overhead of 5-15ms between tasks (see my post below for screenshots).

I’ve ran benchmarks and I drew a few conclusions:

  • Having the dashboard opened in a web page will add overhead
  • Running with debug support (I’m using pycharm) will add overhead
  • Dask needs to be “primed” with a first computation to avoid the startup overhead (that is stated somewhere in the docs too)
  • The task stream visualization is probably not very accurate. I’m assuming it has a high margin of error, maybe 10ms like the profiler has.

For reference this is the “proper” benchmark script I ended up with:
import time
import dask
from distributed import Client

client = Client(processes=False, n_workers=1, threads_per_worker=1)

N = 50

def inc(x):
    for _ in range(100_000):
        x += 1
    return x

start = time.perf_counter()
inp = 1
for i in range(N):
    inp = inc(inp)
python_time = time.perf_counter() - start

# Let's prime the scheduler to avoid the first call being slower

start = time.perf_counter()
inp = 1
for i in range(N):
    inp = dask.delayed(inc)(inp)
dask_time = time.perf_counter() - start

print(f"Dask time: {dask_time:.03f}")
print(f"Python time: {python_time:.03f}")
dask_overhead = (dask_time - python_time) / N
print(f"Dask overhead per task: {1000 * dask_overhead:.3f}ms")

With this benchmark, I do get the overhead announced:

Dask time: 0.241
Python time: 0.186
Dask overhead per task: 1.116ms

I’d like to know if the conclusions I drew are accurate. I’d also like to know if they justify the screenshots I joined. Specifically, for the single threaded example, there is a whole 20ms between that light green task and the dark blue one, with only 3 tiny tasks (these are __get_attribute__ to delayed objects). 20 milliseconds is a lot for my use case, and I’d like to know if that overhead is accurate and if yes, if dask is the cause of it. I can send a performance report instead of screenshots if I haven’t provided enough information.

I would also like to know what dask developers or users think about the usage I am making of dask. I have personally not found a library that allows such a nice paradigm for defining task dependencies, and I have plenty of uses for the dashboard.

Dashboard screenshots (in a separate post due to forum limitations)
Multiple threads
Single thread

Hi @CorentinJ,

First, sorry it took so long for you to get a first answer.

Second, I’m sorry, I wont personally be able to answer all your questions, but I’ll try to help somehow.

Anyway, thanks for this interesting post with so much details.

About your conclusions:

Those are definitely true. I’m not sure about the others, I won’t be surprised that the Dashboard is adding overhead, did you try to completely disabling it?

With which code are you getting those screenshots, is it with the code posted above? I tested it in a Jupyterlab with an opened Dashboard, and get between 3 and 5ms latency between tasks. This is about what you get in the single thread case if we take into account the tiny tasks. I’m afraid Dask is the cause of these few milliseconds latency.

Using only Delayed or Future is a perfect usage of Dask, a lot of people do this. The main limitation in your use case seems to be the short duration of each task. As you probably noticed, in the Efficiency link you posted, it is said:

If your functions run faster than 100ms or so then you might not see any speedup from using distributed computing

So clearly you’re at the limit.