Dilemma: Schedule IO-Bound / CPU-Bound tasks in cascaded clients

bozden · August 24, 2024, 9:06pm

It took a while (and asking ChatGPT) for me to understand what you meant. One of my problems was: If I create a cluster with all my cores and try to use that cluster for both files and transcode sub-processes, the cores were filling with file processes.

Now I defined this:

cluster = LocalCluster(
    ...
   resources={"io_bound": 2, "cpu_bound": 10}
    ...
)

and used client.submit(... resources={"io_bound": 1}) in upper level (files), and with dask.annotate(resources={"cpu_bound": 1}): in transcode sub-processes. That did the trick. Thank you!

For the main problem (where to read data) I still do not have a solution, I keep getting warnings. To create sub-processes from large chunk reads, I used dask.bag, re-chunked it to smaller and used delayed.

Although it is far from perfect, it works better now. This one is from a 3 file dataset (train, dev, test) totaling 3607 records, which took 231 secs to transform, or 13 recs/sec.

With this speed, it will take 22-23 days to pre-process my data. I need to play more to optimize…

Topic		Replies	Views
Advice on how to structure Dask computation Distributed	7	102	January 16, 2025
How to find count of idle workers from scheduler_info?	7	158	September 18, 2024
How does dask schedule to (logical-)cores? Distributed	8	134	September 13, 2024
Sticking strictly to N workers and release resources	4	352	December 22, 2023
Dask distributed with very large queue Distributed	5	1215	November 13, 2022

Dilemma: Schedule IO-Bound / CPU-Bound tasks in cascaded clients

Related topics