The following behavior is interesting.
I collect data coming from futures, and if they reach a certain length, I write out the data to parquet as hive, like this:
I believe the idea is to be extra certain that none of the partitions clash with existing ones: if you are appending, dask will find the maximum part number across the directories. It does not “inverse” your name_function when this happens; actually I don’t remember how it happens.
It seems you actually want append=False, but that would imply removing the existing files.
The actual code looks like
part_i = block_index[0]
filename = (
f"part.{part_i + self.i_offset}.parquet"
if self.name_function is None
else self.name_function(part_i + self.i_offset)
)
so you can construct your name_function to take this integer and do whatever you like with it. You could maybe have
Thank you for the answers and thank you for providing the code part.
I had the same impression. I was re-coding my pyarrow based code to dask, so I need to re-thing the naming.
It is not easy to create a minimal example from thousands of lines thou. But this is what I’ve been doing:
I have audio datasets, which are versioned and have multiple language files in it.
I process each version in a single CLI command, which distributes the languages to processes, so each process handles a single ds-lc-ver.
In each process, I scan a tarfile, leave out already existing audio recordings (using an unique integer id field that I read from parquet into a set - so I only deal with delta), the rest is transcoded and audio metadata gets extracted.
I collect a list of ClipRec’s in a list. When the length of the list reaches a pre-specified length, I write out that batch, reset the list and go on (this is to keep the file size around a specified size and clear out the memory obviously).
It seems you actually want append=False, but that would imply removing the existing files.
So, I use append=True, and they would never collide in a hive leaf directory. I add data gradually, so older ones should stay I have controls in place in CLI validation, when somebody tries to re-import the same ds-lc-ver (I check the hive directory).
I wanted the x, for machines with large RAM, where I can collect (say) 100_000 clips and write them in 10_000 “parts”, so x will be 0-9 in this case (but it is not).
As far as I can see, in my scenario, the “x” provided by dask will not be useful, as it increments globally. I think I can leave it as is although it would not carry semantically relevant information.