Thank you for the answers and thank you for providing the code part.
I had the same impression. I was re-coding my pyarrow based code to dask, so I need to re-thing the naming.
It is not easy to create a minimal example from thousands of lines thou. But this is what I’ve been doing:
- I have audio datasets, which are versioned and have multiple language files in it.
- I process each version in a single CLI command, which distributes the languages to processes, so each process handles a single ds-lc-ver.
- In each process, I scan a tarfile, leave out already existing audio recordings (using an unique integer id field that I read from parquet into a set - so I only deal with delta), the rest is transcoded and audio metadata gets extracted.
- I collect a list of ClipRec’s in a list. When the length of the list reaches a pre-specified length, I write out that batch, reset the list and go on (this is to keep the file size around a specified size and clear out the memory obviously).
It seems you actually want append=False, but that would imply removing the existing files.
So, I use append=True
, and they would never collide in a hive leaf directory. I add data gradually, so older ones should stay I have controls in place in CLI validation, when somebody tries to re-import the same ds-lc-ver (I check the hive directory).
I wanted the x, for machines with large RAM, where I can collect (say) 100_000 clips and write them in 10_000 “parts”, so x will be 0-9 in this case (but it is not).
As far as I can see, in my scenario, the “x” provided by dask will not be useful, as it increments globally. I think I can leave it as is although it would not carry semantically relevant information.