Seemingly more overhead than expected for scheduling tasks

I am interested in running dask on a single machine for serving ML inference code. The request that the production code handles often have parallelizable subtasks, and should generally be completed between 50ms and 1s. I am not interested in any of the scientific computing capabilities of dask, I just appreciate the delayed and Future APIs that allow me to express and parallelize a complex DAG of tasks without radically changing my code. I also like the dashboard: it would let me monitor the execution of incoming queries and profile the slow points of my pipeline. Running the dashboard however means that I have to run the distributed scheduler, even though everything happens on one machine.

I’ve read that the overhead for scheduling a task is about 200-300us for the non-distributed scheduler, and about 1ms for the distributed scheduler [1] [2]. I was initially confused because the dashboard seem to indicate a much larger overhead of 5-15ms between tasks (see my post below for screenshots).

I’ve ran benchmarks and I drew a few conclusions:

  • Having the dashboard opened in a web page will add overhead
  • Running with debug support (I’m using pycharm) will add overhead
  • Dask needs to be “primed” with a first computation to avoid the startup overhead (that is stated somewhere in the docs too)
  • The task stream visualization is probably not very accurate. I’m assuming it has a high margin of error, maybe 10ms like the profiler has.

For reference this is the “proper” benchmark script I ended up with:
import time
import dask
from distributed import Client

client = Client(processes=False, n_workers=1, threads_per_worker=1)

N = 50

def inc(x):
    for _ in range(100_000):
        x += 1
    return x

start = time.perf_counter()
inp = 1
for i in range(N):
    inp = inc(inp)
python_time = time.perf_counter() - start

# Let's prime the scheduler to avoid the first call being slower

start = time.perf_counter()
inp = 1
for i in range(N):
    inp = dask.delayed(inc)(inp)
dask_time = time.perf_counter() - start

print(f"Dask time: {dask_time:.03f}")
print(f"Python time: {python_time:.03f}")
dask_overhead = (dask_time - python_time) / N
print(f"Dask overhead per task: {1000 * dask_overhead:.3f}ms")

With this benchmark, I do get the overhead announced:

Dask time: 0.241
Python time: 0.186
Dask overhead per task: 1.116ms

I’d like to know if the conclusions I drew are accurate. I’d also like to know if they justify the screenshots I joined. Specifically, for the single threaded example, there is a whole 20ms between that light green task and the dark blue one, with only 3 tiny tasks (these are __get_attribute__ to delayed objects). 20 milliseconds is a lot for my use case, and I’d like to know if that overhead is accurate and if yes, if dask is the cause of it. I can send a performance report instead of screenshots if I haven’t provided enough information.

I would also like to know what dask developers or users think about the usage I am making of dask. I have personally not found a library that allows such a nice paradigm for defining task dependencies, and I have plenty of uses for the dashboard.

Dashboard screenshots (in a separate post due to forum limitations)
Multiple threads
Single thread