Sorry, bad “terminology”… I think my scenario is similar to the Launch Tasks from Tasks page.
Pseudo algorithm:
- Get a list of tar.gz files, create a cluster/client, also with the support of
resources
option as explained above. So that I use 2-3 processes (futures) for handling individual tar.gz files, and the remaining cores for sub-tasks (processes for transcoding the audio in my case). - Each file process repeatedly reads in a chunk (say 1000) members from the tar.gz file, filters irrelevant ones out, re-chunks them (say to 100 records), passes them as new processes (futures to do the transcoding) to the same cluster.
I think I tried every possible combination, but could not find a way to read the data at the leaf processes (those for transcoding) lazily. I need to read/write in larger chunks to overcome IO overhead, but that causes large graphs as mentioned in warnings (which I had to silence). The data size it complains is actually in 20-30 MiBs for each sub-chunk (as I pass audio as bytes 100 records take much).
At the end, I need to re-collect them in a list of records and write out them as parquet files of size 500MB-1GB (which uses RAM), which I asked here and here…
I’ll try this without sub-chunking (i.e. just reading 100 records and process them directly) and compare the results and see if I can read them lazily.