Hi, I’m trying to read some data from bigquery, the table contains some classic scalar columns, but also some Array ones.
I’m having some inconsistent behavior with dask-bigquery, depending on the query, the array columns end up in the dask dataframe as np.array, and sometimes the array values get stringified
I also tried to debug and swap it to some delayed funcs, and build the dataframe with dd.from_delayed
Each of the delayed funcs return df with array columns having correct array types inside.
However after dd.from_delayed, if I look at the values, they are now stringified arrays
delayed_dfs = [
bigquery_read(
stream_name=stream.name,
make_create_read_session_request=make_create_read_session_request,
project_id=project_id,
read_kwargs=read_kwargs,
arrow_options=arrow_options,
)
for stream in session.streams
]
return dd.from_delayed(dfs=delayed_dfs, meta=meta)
bigquery_read returns correct pandas dfs. (inner df[“my_arr_col”].array values:
<PandasArray> [array(['a', 'b', 'c,'], dtype=object),array(['d', 'e', 'f,'], dtype=object)]
But when looking at the dd.from_delayed(…)[“my_arr_col”].arrays values are now strings:
<ArrowStringArray> ["['a' 'b', 'c']", "['d' 'e' 'f]"]
I’m also giving the meta arg to dd.from_delayed with the object dtype for my array cols, but because string and array share the same ‘object’ dtype, this doesn’t seems to be the right way
Last resort option would be using ast.literal_eval, but the DF is huge and I would like to avoid the unecessary stringify/parse operation