Serialization error when converting Dask Dataframe to Dask Array

anthony-suv · May 9, 2022, 8:54am

I’m running into an odd serialization error when I attempt to convert a Dask Dataframe into a Dask Array. The Dask client object is connected to a remote cluster. This error does not occur when I try to run this on a local cluster.

2022-05-06 05:34:26,403 - distributed.protocol.pickle - INFO - Failed to serialize (subgraph_callable-2cc3f985-f7cb-4a79-8c96-d4d2257e9b5e, 'from-delayed-ec8e7b03b6f74af03a3c2307fbdebc40', (<function apply at 0x7fd6cdbf5550>, <function _read_sql_chunk at 0x7fd5e96289d0>, [<sqlalchemy.sql.selectable.Select object at 0x7fd6f7d13b50>, 'remote_cluster_address:8123/default', Empty DataFrame
Columns: [feature_a, feature_b, feature_c, label]
Index: []], (<class 'dict'>, [['engine_kwargs', (<class 'dict'>, [])], ['index_col', 'index']]))). Exception: cannot pickle 'sqlalchemy.cprocessors.UnicodeResultProcessor' object
2022-05-06 05:34:26,404 - distributed.protocol.core - CRITICAL - Failed to Serialize

Here is my code to generate this error…

dask_client = dask.distributed.Client(
    dask_scheduler_address, 
    serializers=['dask'],
    deserializers=['dask']
)
sa_metadata = sqlalchemy.MetaData(bind=engine)
sa_table = sqlalchemy.Table(target_table, sa_metadata, autoload_with=engine)

query = sql.select([sql.column(name) for name in feature_names]).select_from(sa_table)

data_df = dask.dataframe.read_sql_query(query, con=conn_str, index_col="index", npartitions=8)
data_df.fillna(method="ffill", axis=1)
data_array = data_df.to_dask_array(lengths=True)  # error occurs on this line

I’m not entirely sure where to proceed from here… Any advice would be greatly appreciated.

martindurant · May 10, 2022, 1:19pm

For read_sql_qeury, you should not pass a connection object, but a string for generating one. Each worker must make its own connection.

con : str
    Full sqlalchemy URI for the database connection

anthony-suv · May 11, 2022, 7:49pm

The connection object is not being passed to the read_sql_query() call. I was passing the URI string.
Seemed to me like the serialization problem has to do with the sqlalchemy select object for the query, not a sqlalchemy connection object.

I managed to find a workaround by splitting up the query, using pandas.dataframe.read_sql(), and building up the dask dataframe partition by partition using dask.dataframe.from_delayed().

Topic		Replies	Views
JSON deserialisation does not work with dask read_sql_query, but with pandas' Dask DataFrame sql	7	79	July 13, 2024
Getting the error in executing the tasks in parallel Distributed delayed	7	749	February 15, 2022
Loading Dataframe with string[pyarrow] into Dask Dask DataFrame	3	1062	June 10, 2022
Dask to_sql slow Dask DataFrame sql	3	494	March 28, 2022
Dask Dataframe, how to keep column with array values Dask DataFrame	2	237	August 16, 2023

Serialization error when converting Dask Dataframe to Dask Array

Related topics