Hello team, I am trying to use parquet to store DataFrame with vector column. My code looks like:
import pandas as pd
import numpy as np
import dask.dataframe as dd
import dask.array as da
import pyarrow as pa
vectors = np.array([
np.array([1.0, 2.0, 2.0, 23.4, 3.4, 3.2, 54.3, 3464.0, 6.3, 3.12]),
np.array([4.0, 5.0, 2.0, 23.4, 3.4, 3.2, 54.3, 3464.0, 6.3, 6.12]),
np.array([7.0, 8.0, 2.0, 23.4, 3.4, 3.2, 54.3, 3464.0, 6.3, 9.12])
])
df = dd.from_dask_array(da.from_array(vectors))
columns_to_drop = df.columns.tolist()
df["vector"] = df.apply(lambda x: tuple(x), axis=1, meta=(None, 'float32'))
df = df.drop(columns=columns_to_drop)
output_path = "vectors-parquet-small"
df.to_parquet(output_path, overwrite=True, schema={
"vector": pa.list_(pa.float32(), 10)
})
df2 = dd.read_parquet(output_path)
df2.dtypes
ds
is vector float32, dtype: object
, so I guess this is correct. I also use schema
in to_parquet
to make sure.
However read_parquet
returns vector string[pyarrow] dtype: object
, for some reasons replacing my vectors with strings.
What is the correct way to preserve columns with vector?