How to improve Dask read_parquet performance while reading 20000 parquet files (very few are corrupted)?

selvavm · October 17, 2022, 1:04pm

I have 20000 parquet files and partitioned by name. I tested spark on this dataset and it was able to do this spark.read_parquet(folder_path).count(). However, when I call read_parquet in Dask it takes forever and there is no cpu/memory being peaked.

I also tried passing in the list of parquet files and could notice deteriorating performance with increasing files (from 0 to 20000).

My guess is Dask (pyarrow) is trying to find column information by reading each parquet file. May I know how to improve this?

df = dd.read_parquet(f'folder_path', columns=["Name", "PhoneNumber", "CallRecords"], engine="pyarrow", ignore_metadata_file=True)

Topic		Replies	Views
Slow processing of parquet dataset using the distributed client Dask DataFrame distributed	1	380	October 11, 2022
Dask not distributing reading of parquet file? Distributed parquet , distributed	1	1726	April 6, 2023
Dask very slow with simple processing of large parquet file Dask DataFrame	2	1744	August 29, 2022
Distributed client on K8 OOM issue Distributed kubernetes	8	257	April 22, 2022
Option to batch read_parquet Dask DataFrame parquet	1	182	April 11, 2023

How to improve Dask read_parquet performance while reading 20000 parquet files (very few are corrupted)?

Related topics