Dilemma: Schedule IO-Bound / CPU-Bound tasks in cascaded clients

It took a while (and asking ChatGPT) for me to understand what you meant. One of my problems was: If I create a cluster with all my cores and try to use that cluster for both files and transcode sub-processes, the cores were filling with file processes.

Now I defined this:

cluster = LocalCluster(
    ...
   resources={"io_bound": 2, "cpu_bound": 10}
    ...
)

and used client.submit(... resources={"io_bound": 1}) in upper level (files), and with dask.annotate(resources={"cpu_bound": 1}): in transcode sub-processes. That did the trick. Thank you!

For the main problem (where to read data) I still do not have a solution, I keep getting warnings. To create sub-processes from large chunk reads, I used dask.bag, re-chunked it to smaller and used delayed.

Although it is far from perfect, it works better now. This one is from a 3 file dataset (train, dev, test) totaling 3607 records, which took 231 secs to transform, or 13 recs/sec.

With this speed, it will take 22-23 days to pre-process my data. I need to play more to optimize…