How to detect actual string type

Description

I cannot find a way to assess if the underlying data is an actual string or not.

Consider the following snippet:

import dask.dataframe as dd

class Foo:
    pass
    
data = {
    "a": list("xyz"),
    "b": [[1], [2, 3], [3, 2, 1]],
    "c": [Foo(), Foo, Foo()]
}

df = dd.from_dict(data, npartitions=1)
df.dtypes

a string[pyarrow]
b string[pyarrow]
c string[pyarrow]
dtype: object

and (maybe even worse), if I try running any .str namespace method, I get no issue/warning/raise:

df["c"].str.split().compute()

Results in

0 [<main.Foo, object, at, 0x7fd9a53b75e0>]
1 [<class, ‘main.Foo’>]
2 [<main.Foo, object, at, 0x7fd9a53b75b0>]
Name: c, dtype: object

Pandas counterpart

In pandas land, some tricks can be applied, depending on how certain we want to be that a column is backed by actual strings.

Example:

import pandas as pd

df = pd.DataFrame(data)
isinstance(df["c"].loc[df["c"].first_valid_index()], str)

False

In general I can get a bunch of random indexes and check if they are strings or not.

Additional info

dask==2024.6.2
pandas==2.2.2
pyarrow==17.0.0

Hi @FBruzzesi, welcome to Dask community!

Since several months, Dask convert all object Pandas dtypes to string[pyarrow] by default. If you don’t want that to happen, you can modify the Dask configuration:

import dask
dask.config.set({"dataframe.convert-string": False})

Then you’ll probably be able to use your Pandas trick?

Hey @guillaumeeb , thanks for the reply.
I moved the discussion to an issue in the dask repo because for some reason the server here blocked me when I tried to publish the question here, and then I forgot to remove it.

Regarding the pandas trick: I think that’s an eager only, as I need to access the element(s).

For more context: I am trying to implement a casting function in Narwhals, I think we will need to assume the datatypes passed from the user is whatever dask say it is. I will sleep on this and see if something more can be done.

If you have ideas, I am all ears :slight_smile: