Reading Parquet directory from HDFS

nhrnjez · February 7, 2024, 7:54pm

I’m looking to read in a parquet directory from HDFS. I’ve used the setup for dask-yarn and am able to successfully read in any single file from HDFS using dask.distributed.read_parquet or dask.distributed.read_csv for example.

This issue is when I point to a folder using df = dd.read_parquet(‘hdfs:///path/to/parquet_folder/’) I am getting an OSError: Prior attempt to load libhdfs failed. This doesn’t occur with any single file in the same location. I’ve tried implementing various settings like different engines, storage_options, ignore_metadata_file as well as different paths but with nothing but the same error.

It seems to be reading the meta data fine but any attempt to do any operation like a len(df) even fails with that OSError.

Any help/direction would be much appreciated.

Thanks.

guillaumeeb · February 7, 2024, 8:06pm

Hi @nhrnjez, welcome to Dask community!

As explained here, Dask uses fsspec to interact with remote file system.

I would recommend to try using it to do simple operations like listing a directory. I suspect some wonrg or not up to date version of libhdfs or Pyarrow.

nhrnjez · February 7, 2024, 8:13pm

Hi @guillaumeeb , thank you!

Is there a reason to expect a version of libhdfs/Pyarrow to work on a single large parquet file but not on a parquet directory containing multiple files?

guillaumeeb · February 7, 2024, 8:17pm

Not out of my mind, it’s mainly based on the error you are getting.

nhrnjez · February 12, 2024, 7:04pm

I was able to figure this out, the issue stemmed from not specifying “ARROW_LIBHDFS_DIR” within the worker_env argument for YarnCluster. Once this was specified my issue was solved.

This solution may be specific to our environment/system.

Topic		Replies	Views
Reading Parquet from Company HDFS Distributed distributed	2	256	December 4, 2023
Loading Parquet file from S3 using HDFS file system Dask DataFrame	4	244	March 8, 2024
KeyError while using the read_parquet method Dask DataFrame	10	1095	August 21, 2023
How to improve Dask read_parquet performance while reading 20000 parquet files (very few are corrupted)? Dask DataFrame	0	206	October 17, 2022
Dask not distributing reading of parquet file? Distributed parquet , distributed	1	1720	April 6, 2023

Reading Parquet directory from HDFS

Related topics