Reading Parquet from Company HDFS

I’m looking to read in parquet files from HDFS. I’ve used the general setup

df = dd.read_parquet(‘hdfs:///hdfs/file/path/your_file.parquet’) but I am getting an OSError: Prior attempt to load libhdfs failed. After doing some research, I noticed a potential problem may be that the file path given is directing Dask to look at a local hdfs rather than my companies distributed one.

Is it possible I would need to specifiy the file system using HDFSFileSystem(host=‘’, port=xxxx)?

Any help/direction would be much appreciated.

Thanks.

Hi @mcostantino77, welcome to Dask Discourse forum.

Di you find the documentation about using HDFS with Dask?

However, the error you are getting seems to say you lack some library in your environment.

I was able to figure it out. It was a version issue with fsspec and pyarrow. Based on the version of Dask we had, I needed to use fsspec 12.1.0 to be compatible and read from HDFS.

1 Like