It seems that p2p shuffling is not supported for a dask dataframe that contains columns of user-defined types. This is not a limitation for tasks shuffling though. Here’s an example:
Here ddf contains a column B of type Foo, and as a result I can’t groupby and apply on ddf with p2p shuffling. I tried changing the meta for B to Foo but it still doesn’t work. Does there exist a way right now to do it?
Yes, I ran it with several recent dask versions and I got the same error saying p2p shuffling failed. And the root cause seems to be that p2p relies on pyarrow, which complains about user-defined types. What I’m not sure about is if there exists a way to get around that. If not, then it’s probably an issue or feature request.
Yes, you are right, serialization only happens when using multiple processes.
You clearly well identified the cause, PyArrow doesn’t know how to serialize your user defined type. There might be some way to get around, but I didn’t find it yet, maybe someone knows better than me!
@fjetter Thanks for the reply. I think the example in OP always runs fine, it’s when it takes place on a cluster that it fails. I tried running with 2023.12.0 and it still fails. Sorry for the confusion.