How to write and read DataFrame with vector column (e.g. list(float64))?

Hello team, I am trying to use parquet to store DataFrame with vector column. My code looks like:

import pandas as pd
import numpy as np
import dask.dataframe as dd
import dask.array as da
import pyarrow as pa

vectors = np.array([
    np.array([1.0, 2.0, 2.0, 23.4, 3.4, 3.2, 54.3, 3464.0, 6.3, 3.12]), 
    np.array([4.0, 5.0, 2.0, 23.4, 3.4, 3.2, 54.3, 3464.0, 6.3, 6.12]), 
    np.array([7.0, 8.0, 2.0, 23.4, 3.4, 3.2, 54.3, 3464.0, 6.3, 9.12])
])

df = dd.from_dask_array(da.from_array(vectors))
columns_to_drop = df.columns.tolist()
df["vector"] = df.apply(lambda x: tuple(x), axis=1, meta=(None, 'float32'))
df = df.drop(columns=columns_to_drop)

output_path = "vectors-parquet-small"

df.to_parquet(output_path, overwrite=True, schema={
    "vector": pa.list_(pa.float32(), 10)
})

df2 = dd.read_parquet(output_path)
df2.dtypes

ds is vector float32, dtype: object, so I guess this is correct. I also use schema in to_parquet to make sure.

However read_parquet returns vector string[pyarrow] dtype: object, for some reasons replacing my vectors with strings.

What is the correct way to preserve columns with vector?

It looks like Dask incorrectly assumes list(float) to be a string, and converts it automatically. The fix for the issue seems to be to turn off “dataframe.convert-string”.

dask.config.set({"dataframe.convert-string": False})

I am not sure if this is desired behaviour, or bug.

Hi,

The dtype of df looks correct, but this is misleading. It just returns what you specified as meta previously. You can see that the actual dtype is object through df.compute().dtypes. You would have to use an astype to actually get the correct dtype, df.astype(pd.ArrowDtype(pa.list_(pa.float32())))

The default in pandas is to read your column in as object dtype as well, which is then cast to string[pyarrow] by Dask. You’ll get an object dtype column if you turn this off. The best way to keep the list dtype would be to set dtype_backend="pyarrow"

df2 = dd.read_parquet(output_path, dtype_backend="pyarrow")

Unfortunately, there is a bug in pandas for fixed size lists right now. I can get the fix into pandas 2.1.1 that will be released in around 2 week (see BUG: ArrowDtype raising for fixed size list by phofl · Pull Request #55000 · pandas-dev/pandas · GitHub). You can use variable sized lists or object dtype till then. Does this help?

1 Like