How to detect actual string type

FBruzzesi · August 12, 2024, 3:10pm

Description

I cannot find a way to assess if the underlying data is an actual string or not.

Consider the following snippet:

import dask.dataframe as dd

class Foo:
    pass
    
data = {
    "a": list("xyz"),
    "b": [[1], [2, 3], [3, 2, 1]],
    "c": [Foo(), Foo, Foo()]
}

df = dd.from_dict(data, npartitions=1)
df.dtypes

a string[pyarrow]
b string[pyarrow]
c string[pyarrow]
dtype: object

and (maybe even worse), if I try running any .str namespace method, I get no issue/warning/raise:

df["c"].str.split().compute()

Results in

0 [<main.Foo, object, at, 0x7fd9a53b75e0>]
1 [<class, ‘main.Foo’>]
2 [<main.Foo, object, at, 0x7fd9a53b75b0>]
Name: c, dtype: object

Pandas counterpart

In pandas land, some tricks can be applied, depending on how certain we want to be that a column is backed by actual strings.

Example:

import pandas as pd

df = pd.DataFrame(data)
isinstance(df["c"].loc[df["c"].first_valid_index()], str)

False

In general I can get a bunch of random indexes and check if they are strings or not.

Additional info

dask==2024.6.2
pandas==2.2.2
pyarrow==17.0.0

guillaumeeb · August 15, 2024, 4:27pm

Hi @FBruzzesi, welcome to Dask community!

Since several months, Dask convert all object Pandas dtypes to string[pyarrow] by default. If you don’t want that to happen, you can modify the Dask configuration:

import dask
dask.config.set({"dataframe.convert-string": False})

Then you’ll probably be able to use your Pandas trick?

FBruzzesi · August 17, 2024, 4:37pm

Hey @guillaumeeb , thanks for the reply.
I moved the discussion to an issue in the dask repo because for some reason the server here blocked me when I tried to publish the question here, and then I forgot to remove it.

Regarding the pandas trick: I think that’s an eager only, as I need to access the element(s).

For more context: I am trying to implement a casting function in Narwhals, I think we will need to assume the datatypes passed from the user is whatever dask say it is. I will sleep on this and see if something more can be done.

If you have ideas, I am all ears

Topic		Replies	Views
DDF is converting column of lists/dicts to strings Dask DataFrame	2	1006	January 18, 2024
How to check if dataframe is dask Dask DataFrame	2	311	January 19, 2023
Dask .to_parquet() errors when saving lists of integers (object types) with convert-string: False	1	2033	January 25, 2024
Still cannot get rid of string conversion for blob Dask DataFrame	3	65	August 30, 2024
Dask-sql NotImplementedError	6	196	March 28, 2024

How to detect actual string type

Description

Pandas counterpart

Additional info

Related topics