Dilemma: Schedule IO-Bound / CPU-Bound tasks in cascaded clients

bozden · August 28, 2024, 4:14pm

Sorry, bad “terminology”… I think my scenario is similar to the Launch Tasks from Tasks page.

Pseudo algorithm:

Get a list of tar.gz files, create a cluster/client, also with the support of resources option as explained above. So that I use 2-3 processes (futures) for handling individual tar.gz files, and the remaining cores for sub-tasks (processes for transcoding the audio in my case).
Each file process repeatedly reads in a chunk (say 1000) members from the tar.gz file, filters irrelevant ones out, re-chunks them (say to 100 records), passes them as new processes (futures to do the transcoding) to the same cluster.

I think I tried every possible combination, but could not find a way to read the data at the leaf processes (those for transcoding) lazily. I need to read/write in larger chunks to overcome IO overhead, but that causes large graphs as mentioned in warnings (which I had to silence). The data size it complains is actually in 20-30 MiBs for each sub-chunk (as I pass audio as bytes 100 records take much).

At the end, I need to re-collect them in a list of records and write out them as parquet files of size 500MB-1GB (which uses RAM), which I asked here and here…

I’ll try this without sub-chunking (i.e. just reading 100 records and process them directly) and compare the results and see if I can read them lazily.

Topic		Replies	Views
Advice on how to structure Dask computation Distributed	7	53	January 16, 2025
How to find count of idle workers from scheduler_info?	7	87	September 18, 2024
How does dask schedule to (logical-)cores? Distributed	8	83	September 13, 2024
Sticking strictly to N workers and release resources	4	297	December 22, 2023
Dask distributed with very large queue Distributed	5	1085	November 13, 2022

Dilemma: Schedule IO-Bound / CPU-Bound tasks in cascaded clients

Related topics