Dask Dataframe, how to keep column with array values

Hi, I’m trying to read some data from bigquery, the table contains some classic scalar columns, but also some Array ones.

I’m having some inconsistent behavior with dask-bigquery, depending on the query, the array columns end up in the dask dataframe as np.array, and sometimes the array values get stringified

I also tried to debug and swap it to some delayed funcs, and build the dataframe with dd.from_delayed
Each of the delayed funcs return df with array columns having correct array types inside.
However after dd.from_delayed, if I look at the values, they are now stringified arrays

         delayed_dfs = [
            bigquery_read(
                stream_name=stream.name,
                make_create_read_session_request=make_create_read_session_request,
                project_id=project_id,
                read_kwargs=read_kwargs,
                arrow_options=arrow_options,
            )
            for stream in session.streams
        ]
        return dd.from_delayed(dfs=delayed_dfs, meta=meta)

bigquery_read returns correct pandas dfs. (inner df[“my_arr_col”].array values:
<PandasArray> [array(['a', 'b', 'c,'], dtype=object),array(['d', 'e', 'f,'], dtype=object)]
But when looking at the dd.from_delayed(…)[“my_arr_col”].arrays values are now strings:
<ArrowStringArray> ["['a' 'b', 'c']", "['d' 'e' 'f]"]

I’m also giving the meta arg to dd.from_delayed with the object dtype for my array cols, but because string and array share the same ‘object’ dtype, this doesn’t seems to be the right way

Last resort option would be using ast.literal_eval, but the DF is huge and I would like to avoid the unecessary stringify/parse operation

Hi @TheTrope, welcome to Dask community!

Do you think you could come up with some repoducer not depending on BigQuery? Maybe by creating a Pandas DataFrame wirh arrays, wrapping it in delayed calls, and trying to build a Dask DataFrame from this?

Hi @guillaumeeb

The shortest sample I can do that showcase my problem is:


import dask.dataframe as dd
import pandas as pd
data = pd.DataFrame({"other_column": [1] * 100000, "test": [[1, 2, 3]] * 100000})
ddf = dd.from_pandas(data, npartitions=20)
print(ddf.head()["test"].values)
print(type(ddf.head()["test"].iloc[0]))

Result:

[‘[1, 2, 4]’, ‘[1, 2, 4]’, ‘[1, 2, 4]’, ‘[1, 2, 4]’, ‘[1, 2, 4]’]
Length: 5, dtype: string
<class ‘str’>

If I try with np.array instead of list I have a similar stringified result (without comma between numbers, so thats what I had with bigquery)

Thanks