I am interested in running dask on a single machine for serving ML inference code. The request that the production code handles often have parallelizable subtasks, and should generally be completed between 50ms and 1s. I am not interested in any of the scientific computing capabilities of dask, I just appreciate the delayed
and Future
APIs that allow me to express and parallelize a complex DAG of tasks without radically changing my code. I also like the dashboard: it would let me monitor the execution of incoming queries and profile the slow points of my pipeline. Running the dashboard however means that I have to run the distributed scheduler, even though everything happens on one machine.
I’ve read that the overhead for scheduling a task is about 200-300us for the non-distributed scheduler, and about 1ms for the distributed scheduler [1] [2]. I was initially confused because the dashboard seem to indicate a much larger overhead of 5-15ms between tasks (see my post below for screenshots).
I’ve ran benchmarks and I drew a few conclusions:
- Having the dashboard opened in a web page will add overhead
- Running with debug support (I’m using pycharm) will add overhead
- Dask needs to be “primed” with a first computation to avoid the startup overhead (that is stated somewhere in the docs too)
- The task stream visualization is probably not very accurate. I’m assuming it has a high margin of error, maybe 10ms like the profiler has.
For reference this is the “proper” benchmark script I ended up with:
benchmark.py
import time
import dask
from distributed import Client
client = Client(processes=False, n_workers=1, threads_per_worker=1)
N = 50
def inc(x):
for _ in range(100_000):
x += 1
return x
start = time.perf_counter()
inp = 1
for i in range(N):
inp = inc(inp)
python_time = time.perf_counter() - start
# Let's prime the scheduler to avoid the first call being slower
dask.delayed(inc)(1).compute()
start = time.perf_counter()
inp = 1
for i in range(N):
inp = dask.delayed(inc)(inp)
inp.compute()
dask_time = time.perf_counter() - start
print(f"Dask time: {dask_time:.03f}")
print(f"Python time: {python_time:.03f}")
dask_overhead = (dask_time - python_time) / N
print(f"Dask overhead per task: {1000 * dask_overhead:.3f}ms")
With this benchmark, I do get the overhead announced:
Dask time: 0.241
Python time: 0.186
Dask overhead per task: 1.116ms
I’d like to know if the conclusions I drew are accurate. I’d also like to know if they justify the screenshots I joined. Specifically, for the single threaded example, there is a whole 20ms between that light green task and the dark blue one, with only 3 tiny tasks (these are __get_attribute__
to delayed
objects). 20 milliseconds is a lot for my use case, and I’d like to know if that overhead is accurate and if yes, if dask is the cause of it. I can send a performance report instead of screenshots if I haven’t provided enough information.
I would also like to know what dask developers or users think about the usage I am making of dask. I have personally not found a library that allows such a nice paradigm for defining task dependencies, and I have plenty of uses for the dashboard.