Hi, I have a question about Dask Worker & Thread configuration. I use dask with the idea to improve the running time & CPU utilization for running a current pipeline, but I’m a bit confused on how Dask workers & threads work. This is my situation:
I have this CPU-heavy computing data processing pipeline & I called it as a single function “run_pipeline”.
I need to run this processing pipeline to multiple number of data, for simplicity assume there are 6 data groups where I need to run the processing pipeline.
I use dask.delayed for each pipeline run, therefore there are 6 tasks to be distributed.
I have 2 PC connected to each other, each with 12 (out of 16) threads that I reserved for running this pipeline. Therefore the resource I has is 2*12 = 24 threads in total
My questions are:
Does each task will be executed by single worker (Therefore the optimal settings per PC is: dask worker tcp://[Local IP] --nthreads 4 --nworkers 3, where each task will be processed by 4 threads)?
Or does each task still execute by a single thread, where multiple tasks can be assigned to the same worker (As a single worker has multiple threads) instead of utilizing all resources?
If option 2 is the case, what is the recommended configuration to utilize as many resources as possible?
As it has been conceived, by default Dask attach each task (at the beginning this was for processing chunks of big datasets) to one thread. It is not a full feature scheduling system.
So if you want your tasks to be evenly distributed and not end up on only one node, the simplest solution is to either launch 3 workers with 1 thread per node, or 1 worker with 3 threads per node.
Be careful though, depending on your code, using multithreading might not work, you can try both, but multiple workers per node is probably safer.
Hi, thank you for the warm welcome. Sorry if my question sounds basic, I’m pretty new to dask & the world of distributed processing.
Regarding my question, if I use --nthreads 1 --nworkers 3, then each pipeline can only utilize 1 thread each?
To improve the CPU utilization, is that possible if I implement multiprocessing.Pool(4) inside each pipeline? Or will it be capped by 1 thread as each was executed inside a single worker that only has a single thread?
Well, one Python thread for Dask, but it doesn’t mean it will be capped to one CPU thread, it all depends on your code.
I wouldn’t recommend mixing Dask and multi-processing. If you can parallelize your pipeline, it should either be with low level libraries, else if you want to use Dask, just use Dask inside your pipeline too.
That said, no it shouldn’t, so using multiprocessing should be possible in theory.
I try to avoid using Dask inside the pipeline code, because what I aim for is to create an agnostic system where it can process any blackbox data pipeline in a distributed manner.
So if I understand it correctly, the nthreads configuration inside Dask worker is only to book a resource for this Dask worker to act as “scheduler” to execute a task. Then, the task execution itself may utilize as many threads as available on the machine (depends on the code, like using multiprocessing in my previous example) instead of the resources attached to the worker?