Dask Dataframe, how to keep column with array values

TheTrope · August 9, 2023, 9:42am

Hi, I’m trying to read some data from bigquery, the table contains some classic scalar columns, but also some Array ones.

I’m having some inconsistent behavior with dask-bigquery, depending on the query, the array columns end up in the dask dataframe as np.array, and sometimes the array values get stringified

I also tried to debug and swap it to some delayed funcs, and build the dataframe with dd.from_delayed
Each of the delayed funcs return df with array columns having correct array types inside.
However after dd.from_delayed, if I look at the values, they are now stringified arrays

         delayed_dfs = [
            bigquery_read(
                stream_name=stream.name,
                make_create_read_session_request=make_create_read_session_request,
                project_id=project_id,
                read_kwargs=read_kwargs,
                arrow_options=arrow_options,
            )
            for stream in session.streams
        ]
        return dd.from_delayed(dfs=delayed_dfs, meta=meta)

bigquery_read returns correct pandas dfs. (inner df[“my_arr_col”].array values:
<PandasArray> [array(['a', 'b', 'c,'], dtype=object),array(['d', 'e', 'f,'], dtype=object)]
But when looking at the dd.from_delayed(…)[“my_arr_col”].arrays values are now strings:
<ArrowStringArray> ["['a' 'b', 'c']", "['d' 'e' 'f]"]

I’m also giving the meta arg to dd.from_delayed with the object dtype for my array cols, but because string and array share the same ‘object’ dtype, this doesn’t seems to be the right way

Last resort option would be using ast.literal_eval, but the DF is huge and I would like to avoid the unecessary stringify/parse operation

guillaumeeb · August 10, 2023, 1:27pm

Hi @TheTrope, welcome to Dask community!

Do you think you could come up with some repoducer not depending on BigQuery? Maybe by creating a Pandas DataFrame wirh arrays, wrapping it in delayed calls, and trying to build a Dask DataFrame from this?

TheTrope · August 16, 2023, 9:23pm

Hi @guillaumeeb

The shortest sample I can do that showcase my problem is:


import dask.dataframe as dd
import pandas as pd
data = pd.DataFrame({"other_column": [1] * 100000, "test": [[1, 2, 3]] * 100000})
ddf = dd.from_pandas(data, npartitions=20)
print(ddf.head()["test"].values)
print(type(ddf.head()["test"].iloc[0]))

Result:

[‘[1, 2, 4]’, ‘[1, 2, 4]’, ‘[1, 2, 4]’, ‘[1, 2, 4]’, ‘[1, 2, 4]’]
Length: 5, dtype: string
<class ‘str’>

If I try with np.array instead of list I have a similar stringified result (without comma between numbers, so thats what I had with bigquery)

Thanks

Topic		Replies	Views
Loading Dataframe with string[pyarrow] into Dask Dask DataFrame	3	1063	June 10, 2022
Uploading DD to BigQuery table using Streaming BQ API (bigquery.client) Dask DataFrame	5	825	April 11, 2023
Dask DataFrame unhashable type: 'numpy.ndarray' Dask DataFrame dask-array	3	904	June 4, 2023
Serialization error when converting Dask Dataframe to Dask Array Dask DataFrame dask-array , distributed	2	1409	May 11, 2022
How to upload dataframe with numpy array column using to_parquet in dask.dataframe? Dask DataFrame	2	845	August 29, 2023

Dask Dataframe, how to keep column with array values

Related topics