Hi,
I am wondering if there is an existing API to batch read_parquet such that each distributed task will read >1 parquet files sequentiallt?
My use case is on a parquet dataset where each partition contains only one small file (<1MB). The dataset is partitioned on (date, identifier), and when querying time series of specific identifier, I could be reading >1000 files, which creates a significant overhead on a distributed cluster. I think it might be more efficient to batch the read_parquet into tasks that read 10 files sequentially for example. In this case, the distributed cluster would only handle 100 jobs instead of ~1000.
Please let me know if I am thinking about this problem wrong or if there is an alternative solution.
Thanks