Dask DataFrame unhashable type: 'numpy.ndarray'

rsohaljr_14 · June 2, 2023, 7:18am

Hi everyone. I just happened to read a very huge file using dask.dataframe.read_parquet(). The problem I am having now is related to the dataset. Upon executing the codes below, I get an error unhashable type: 'numpy.ndarray'. Snippet:

train_path = ["/somfile.train.snappy.parquet"]
data = dd.read_parquet(train_path, engine="pyarrow", compression = "snappy", columns = "X_jets",
                      split_row_groups=10)
data = data.value_counts()
data = data.compute()

I understand that my data contains a numpy array. The question then is, how do I load it? Should I get away of calling compute()? Here is a link to the file that I am using: QCDToGGQQ_IMGjet_RH1all_jet0_run1_n47540.test.snappy.parquet - Google Drive

Any form of help will be highly appreciated. Thanks!

guillaumeeb · June 2, 2023, 10:41am

Hi @rsohaljr_14,

I think you should have waited a bit before opening a new thread, and preferably completed the other one.

This code will try to count the unique values of each of the Series in your DataFrame, which is not possible on a column with Numpy arrays. Hence this error.

rsohaljr_14 · June 2, 2023, 1:49pm

I am not sure if I understood your response. You were saying that I cannot use .value_counts()? Should I use .to_delayed() instead?

guillaumeeb · June 4, 2023, 1:22am

value_counts and to_delayed have completely different goals and results. value_counts is just not compatible with 'numpy.ndarray' columns.

Again, the question is what are you trying to achieve with this data?

Topic		Replies	Views
Unpacking .snappy.parquet File Dask DataFrame	10	2557	June 21, 2023
Create an numpy array from dask dataframe Dask DataFrame	1	1656	August 31, 2022
How to upload dataframe with numpy array column using to_parquet in dask.dataframe? Dask DataFrame	2	807	August 29, 2023
Dask Dataframe, how to keep column with array values Dask DataFrame	2	229	August 16, 2023
Slicing a dask array with a dask dataframe in one compute Dask Array dask-array , distributed	6	1566	January 14, 2022

Dask DataFrame unhashable type: 'numpy.ndarray'

Related topics