Dask DataFrame unhashable type: 'numpy.ndarray'

Hi everyone. I just happened to read a very huge file using dask.dataframe.read_parquet(). The problem I am having now is related to the dataset. Upon executing the codes below, I get an error unhashable type: 'numpy.ndarray'. Snippet:

train_path = ["/somfile.train.snappy.parquet"]
data = dd.read_parquet(train_path, engine="pyarrow", compression = "snappy", columns = "X_jets",
                      split_row_groups=10)
data = data.value_counts()
data = data.compute()

I understand that my data contains a numpy array. The question then is, how do I load it? Should I get away of calling compute()? Here is a link to the file that I am using: QCDToGGQQ_IMGjet_RH1all_jet0_run1_n47540.test.snappy.parquet - Google Drive

Any form of help will be highly appreciated. Thanks!

Hi @rsohaljr_14,

I think you should have waited a bit before opening a new thread, and preferably completed the other one.

This code will try to count the unique values of each of the Series in your DataFrame, which is not possible on a column with Numpy arrays. Hence this error.

I am not sure if I understood your response. You were saying that I cannot use .value_counts()? Should I use .to_delayed() instead?

value_counts and to_delayed have completely different goals and results. value_counts is just not compatible with 'numpy.ndarray' columns.

Again, the question is what are you trying to achieve with this data?