Option to batch read_parquet

Hi,

I am wondering if there is an existing API to batch read_parquet such that each distributed task will read >1 parquet files sequentiallt?

My use case is on a parquet dataset where each partition contains only one small file (<1MB). The dataset is partitioned on (date, identifier), and when querying time series of specific identifier, I could be reading >1000 files, which creates a significant overhead on a distributed cluster. I think it might be more efficient to batch the read_parquet into tasks that read 10 files sequentially for example. In this case, the distributed cluster would only handle 100 jobs instead of ~1000.

Please let me know if I am thinking about this problem wrong or if there is an alternative solution.

Thanks

Hi @longshort, welcome to this forum!

In the read_parquet API, you have the aggregate_files that you could use along with split_row_groups, however, there are currently discussions to remove it.

Instead, the current recommended way to achieve what you want is using from_map: dask.dataframe.from_map — Dask documentation. See some examples in the documentation.

There is a blog post on from_map under construction.