How to write and read DataFrame with vector column (e.g. list(float64))?

chaliy · September 1, 2023, 4:01pm

Hello team, I am trying to use parquet to store DataFrame with vector column. My code looks like:

import pandas as pd
import numpy as np
import dask.dataframe as dd
import dask.array as da
import pyarrow as pa

vectors = np.array([
    np.array([1.0, 2.0, 2.0, 23.4, 3.4, 3.2, 54.3, 3464.0, 6.3, 3.12]), 
    np.array([4.0, 5.0, 2.0, 23.4, 3.4, 3.2, 54.3, 3464.0, 6.3, 6.12]), 
    np.array([7.0, 8.0, 2.0, 23.4, 3.4, 3.2, 54.3, 3464.0, 6.3, 9.12])
])

df = dd.from_dask_array(da.from_array(vectors))
columns_to_drop = df.columns.tolist()
df["vector"] = df.apply(lambda x: tuple(x), axis=1, meta=(None, 'float32'))
df = df.drop(columns=columns_to_drop)

output_path = "vectors-parquet-small"

df.to_parquet(output_path, overwrite=True, schema={
    "vector": pa.list_(pa.float32(), 10)
})

df2 = dd.read_parquet(output_path)
df2.dtypes

ds is vector float32, dtype: object, so I guess this is correct. I also use schema in to_parquet to make sure.

However read_parquet returns vector string[pyarrow] dtype: object, for some reasons replacing my vectors with strings.

What is the correct way to preserve columns with vector?

chaliy · September 1, 2023, 4:42pm

It looks like Dask incorrectly assumes list(float) to be a string, and converts it automatically. The fix for the issue seems to be to turn off “dataframe.convert-string”.

dask.config.set({"dataframe.convert-string": False})

I am not sure if this is desired behaviour, or bug.

Patrick · September 4, 2023, 12:17pm

Hi,

The dtype of df looks correct, but this is misleading. It just returns what you specified as meta previously. You can see that the actual dtype is object through df.compute().dtypes. You would have to use an astype to actually get the correct dtype, df.astype(pd.ArrowDtype(pa.list_(pa.float32())))

The default in pandas is to read your column in as object dtype as well, which is then cast to string[pyarrow] by Dask. You’ll get an object dtype column if you turn this off. The best way to keep the list dtype would be to set dtype_backend="pyarrow"

df2 = dd.read_parquet(output_path, dtype_backend="pyarrow")

Unfortunately, there is a bug in pandas for fixed size lists right now. I can get the fix into pandas 2.1.1 that will be released in around 2 week (see BUG: ArrowDtype raising for fixed size list by phofl · Pull Request #55000 · pandas-dev/pandas · GitHub). You can use variable sized lists or object dtype till then. Does this help?

Topic		Replies	Views
How to upload dataframe with numpy array column using to_parquet in dask.dataframe? Dask DataFrame	2	795	August 29, 2023
Dask .to_parquet() errors when saving lists of integers (object types) with convert-string: False	1	2031	January 25, 2024
Unpacking .snappy.parquet File Dask DataFrame	10	2507	June 21, 2023
Still cannot get rid of string conversion for blob Dask DataFrame	3	64	August 30, 2024
Dask DataFrame unhashable type: 'numpy.ndarray' Dask DataFrame dask-array	3	892	June 4, 2023

How to write and read DataFrame with vector column (e.g. list(float64))?

Related topics