How to reset x in name_function?

bozden · August 30, 2024, 5:09pm

Thank you for the answers and thank you for providing the code part.

I had the same impression. I was re-coding my pyarrow based code to dask, so I need to re-thing the naming.

It is not easy to create a minimal example from thousands of lines thou. But this is what I’ve been doing:

I have audio datasets, which are versioned and have multiple language files in it.
I process each version in a single CLI command, which distributes the languages to processes, so each process handles a single ds-lc-ver.
In each process, I scan a tarfile, leave out already existing audio recordings (using an unique integer id field that I read from parquet into a set - so I only deal with delta), the rest is transcoded and audio metadata gets extracted.
- I collect a list of ClipRec’s in a list. When the length of the list reaches a pre-specified length, I write out that batch, reset the list and go on (this is to keep the file size around a specified size and clear out the memory obviously).

It seems you actually want append=False, but that would imply removing the existing files.

So, I use append=True, and they would never collide in a hive leaf directory. I add data gradually, so older ones should stay I have controls in place in CLI validation, when somebody tries to re-import the same ds-lc-ver (I check the hive directory).

I wanted the x, for machines with large RAM, where I can collect (say) 100_000 clips and write them in 10_000 “parts”, so x will be 0-9 in this case (but it is not).

As far as I can see, in my scenario, the “x” provided by dask will not be useful, as it increments globally. I think I can leave it as is although it would not carry semantically relevant information.

Topic		Replies	Views
Customizing `to_parquet()`: split one partition into many parquet files based on certain criteria Dask DataFrame	5	358	April 26, 2022
Quick Q on dask parquet append Dask DataFrame parquet	5	35	August 16, 2024
Memory issues arising from writing partitions with to_parquet	5	725	September 18, 2023
Should `set_index` change the row count of a DataFrame? Dask DataFrame	1	240	January 15, 2022
Improving pipeline resilience when using `to_parquet` and preemptible workers Dask DataFrame distributed	5	437	August 25, 2023

How to reset x in name_function?

Related topics