Customizing `to_parquet()`: split one partition into many parquet files based on certain criteria

Hi,

Assume that I have a dask dataframe a, and I want to save the content of a into parquet files. However, instead of saving one parquet per partition, I want to split the rows in one partition into multiple parquets based on certain criteria. For example, let’s say a has a column labeled seq_len; for the rows whose seq_len is less than 128, I want to save them into 001.parquet_0; for the rows whose seq_len is within [128, 256), I want to save them into 001.parquet_1; for the rows whose seq_len is 256 and above, I want to save them into 001.parquet_2. (And the same goes for the other partitions as well: [‘010.parquet_0’, ‘010.parquet_1’, ‘010.parquet_2’], [‘100.parquet_0’, ‘100.parquet_1’, ‘100.parquet_2’], etc.)

Currently we are doing this by overriding the to_parquet function: https://github.com/NVIDIA/DeepLearningExamples/blob/master/Tools/lddl/lddl/dask/bert/binning.py#L111, however, I’m wondering if there could be a cleaner way to do this in Dask?

Thanks!

to_parquet() has the partition_on keyword argument, which can take a list of column names to create directory-based partitions. You can also control the name of the partition files with the name_function keyword argument. Do either of those help?

2 Likes

Thanks a lot for your help! They seems very likely the things that we are looking for. Let me try them first and get back to you.

Hi @bryanweber ,

partition_on does seem to do the trick. However, I’m not exactly sure how to use name_function to control the naming of the partition. Say, for example, bin_id is a column whose values can only be one of {0, 1, 2, 3}.

data.to_parquet(
  path,
  engine='pyarrow',
  write_index=False,
  schema=schema,
  partition_on=['bin_id']
)

would generate the following files:

bin_id=0/
  - bin_id=0/part.1.parquet
  - bin_id=0/part.2.parquet
  ...
bin_id=1/
  ...
bin_id=2/
  ...
bin_id=3/
  ...

I’m wondering if there exists a way to make the result structure to be something like the following?

part-1.bin_id-0.parquet
part-1.bin_id-1.parquet
...
part-2.bin_id-0.parquet
part-2.bin_id-1.parquet
...

Thanks again!

Hi! The name_function is a callable that takes an integer and returns the string that should be the filename. I don’t think you can access data inside that function though, so you wouldn’t be able to do what you’re suggesting here I don’t think, since the bin_id would not be known.

If you really want that structure, I think you’d have to re-partition your data into the structure you want, then write that out.

But that brings up the question - why do you need to encode the bin number in the filename specifically?

I don’t necessarily have to encode the bin_id in the filename. I guess I’m just trying to understand how much flexibility I would have in terms of customizing the file structures of the output parquets.

One practical motivation is that, currently I have existing datasets already generated in that “bin_id-encoded-in-file-name” type of structure, and many people are already using these datasets. If I update the code, I’ll notify those people about the new file structure. But I guess it’s a hustle that I would be able to manage.

Again, thanks a lot for your help! @bryanweber

1 Like