I´ve been using category Dtype on pandas for a long time to reduce memory comsuption. But as I am converting my codes to Dask, I´ve been facing counledd issues related to such Dtype.
My question is, category does worth on dask? Does it increase performance?
Could you provide some minimum reproducible example?
My question is more as conceptual discussion. Category dtype on pandas is incredible important due to its memory consumption reduction, but I wonder if it offers any real improvement on dask.
I’m guessing the memory usage reductions in pandas you’re referring to are comparing the “category” dtype to the “object” dtype. In Dask version >= 2023.7.1 object data types are are converted to PyArrow strings, which use far less memory. This blog post has more details on the performance improvements you can expect.