Reading Parquet directory from HDFS

I’m looking to read in a parquet directory from HDFS. I’ve used the setup for dask-yarn and am able to successfully read in any single file from HDFS using dask.distributed.read_parquet or dask.distributed.read_csv for example.

This issue is when I point to a folder using df = dd.read_parquet(‘hdfs:///path/to/parquet_folder/’) I am getting an OSError: Prior attempt to load libhdfs failed. This doesn’t occur with any single file in the same location. I’ve tried implementing various settings like different engines, storage_options, ignore_metadata_file as well as different paths but with nothing but the same error.

It seems to be reading the meta data fine but any attempt to do any operation like a len(df) even fails with that OSError.

Any help/direction would be much appreciated.

Thanks.

Hi @nhrnjez, welcome to Dask community!

As explained here, Dask uses fsspec to interact with remote file system.

I would recommend to try using it to do simple operations like listing a directory. I suspect some wonrg or not up to date version of libhdfs or Pyarrow.

Hi @guillaumeeb , thank you!

Is there a reason to expect a version of libhdfs/Pyarrow to work on a single large parquet file but not on a parquet directory containing multiple files?

Not out of my mind, it’s mainly based on the error you are getting.

I was able to figure this out, the issue stemmed from not specifying “ARROW_LIBHDFS_DIR” within the worker_env argument for YarnCluster. Once this was specified my issue was solved.

This solution may be specific to our environment/system.

1 Like