I have 20000 parquet files and partitioned by name
. I tested spark on this dataset and it was able to do this spark.read_parquet(folder_path).count()
. However, when I call read_parquet
in Dask it takes forever and there is no cpu/memory being peaked.
I also tried passing in the list of parquet files and could notice deteriorating performance with increasing files (from 0 to 20000).
My guess is Dask (pyarrow) is trying to find column information by reading each parquet file. May I know how to improve this?
df = dd.read_parquet(f'folder_path', columns=["Name", "PhoneNumber", "CallRecords"], engine="pyarrow", ignore_metadata_file=True)