Option to batch read_parquet

longshort · April 10, 2023, 7:56pm

Hi,

I am wondering if there is an existing API to batch read_parquet such that each distributed task will read >1 parquet files sequentiallt?

My use case is on a parquet dataset where each partition contains only one small file (<1MB). The dataset is partitioned on (date, identifier), and when querying time series of specific identifier, I could be reading >1000 files, which creates a significant overhead on a distributed cluster. I think it might be more efficient to batch the read_parquet into tasks that read 10 files sequentially for example. In this case, the distributed cluster would only handle 100 jobs instead of ~1000.

Please let me know if I am thinking about this problem wrong or if there is an alternative solution.

Thanks

guillaumeeb · April 11, 2023, 4:59pm

Hi @longshort, welcome to this forum!

In the read_parquet API, you have the aggregate_files that you could use along with split_row_groups, however, there are currently discussions to remove it.

Instead, the current recommended way to achieve what you want is using from_map: dask.dataframe.from_map — Dask documentation. See some examples in the documentation.

There is a blog post on from_map under construction.

Topic		Replies	Views
Dask not distributing reading of parquet file? Distributed parquet , distributed	1	1708	April 6, 2023
How does read_csv or read_parquet distribute read operations? Dask DataFrame	3	318	June 14, 2022
How to improve Dask read_parquet performance while reading 20000 parquet files (very few are corrupted)? Dask DataFrame	0	203	October 17, 2022
Aggregation of many dataframes stored in parquet files Dask DataFrame dask-bag , distributed	7	418	February 2, 2022
Distributed client on K8 OOM issue Distributed kubernetes	8	252	April 22, 2022

Option to batch read_parquet

Related topics