I’m looking to read in a parquet directory from HDFS. I’ve used the setup for dask-yarn and am able to successfully read in any single file from HDFS using dask.distributed.read_parquet or dask.distributed.read_csv for example.
This issue is when I point to a folder using df = dd.read_parquet(‘hdfs:///path/to/parquet_folder/’) I am getting an OSError: Prior attempt to load libhdfs failed. This doesn’t occur with any single file in the same location. I’ve tried implementing various settings like different engines, storage_options, ignore_metadata_file as well as different paths but with nothing but the same error.
It seems to be reading the meta data fine but any attempt to do any operation like a len(df) even fails with that OSError.
Any help/direction would be much appreciated.
Thanks.