Description
I cannot find a way to assess if the underlying data is an actual string or not.
Consider the following snippet:
import dask.dataframe as dd
class Foo:
pass
data = {
"a": list("xyz"),
"b": [[1], [2, 3], [3, 2, 1]],
"c": [Foo(), Foo, Foo()]
}
df = dd.from_dict(data, npartitions=1)
df.dtypes
a string[pyarrow]
b string[pyarrow]
c string[pyarrow]
dtype: object
and (maybe even worse), if I try running any .str
namespace method, I get no issue/warning/raise:
df["c"].str.split().compute()
Results in
0 [<main.Foo, object, at, 0x7fd9a53b75e0>]
1 [<class, ‘main.Foo’>]
2 [<main.Foo, object, at, 0x7fd9a53b75b0>]
Name: c, dtype: object
Pandas counterpart
In pandas land, some tricks can be applied, depending on how certain we want to be that a column is backed by actual strings.
Example:
import pandas as pd
df = pd.DataFrame(data)
isinstance(df["c"].loc[df["c"].first_valid_index()], str)
False
In general I can get a bunch of random indexes and check if they are strings or not.
Additional info
dask==2024.6.2
pandas==2.2.2
pyarrow==17.0.0