Distributed tasks not starting at the same time

Hi, I’m using Dask to run multiple data processing pipelines in parallel to overcome the GIL.

I use Dask delayed to delay each pipeline and use a for loop to run the pipeline.

I have 16 data points that I want to process, and I have created 16 workers, so theoretically, all tasks should run in parallel.

As shown on the attached image, based on the Task Processing section of the dashboard, all 16 workers are processing the tasks.

But when it finished and appeared on the Task Stream section, it observed that not all tasks start at the same time.
So, although all tasks’ processing time is relatively equal, the runtime required is almost (sometimes even more) double the maximum time of a single task.
Does anyone have an idea why it is? And if possible how to fix it?

Hi @amalibnu,

Dask Distributed start a process for each worker, with several services per process like worker Dashboard. It’s not unusual to see a few seconds delay between the Dask cluster creation and the moment when all workers are up and ready to process a task. If your tasks are only lasting a few seconds, this overhead can be seen as above. I’m not sure if there is a good way to avoid that. If your workflow is really simple, and you don’t want to go distributed, maybe using multiprocessing package is enough?