After upgrade in dask dataframe.str.match is giving error for same regex

I have upgraded dask version from 2023.5.0 to 2024.1.0.

the same regex is not giving error for pandas. I don’t know what get changed in dask after upgrade.

here is pandas code.

>> import pandas as pd

>> ds = pd.Series(["asdfa","afdsewr"],dtype=str)
>> reg = '\\b((((?!(000|666))([0-8][0-9][0-9]))((?!00)([0-9]{2}))((?!0000)([0-9]{4})))|(98765432[0-9]))\\b","\\b((((?!(000|666))([0-8][0-9][0-9]))[\\ ]((?!00)([0-9]{2}))[\\ ]((?!0000)([0-9]{4})))|((987)-(65)-(432)[0-9]))\\b","\\b((((?!(000|666))([0-8][0-9][0-9]))-((?!00)([0-9]{2}))-((?!0000)([0-9]{4})))|((987)-(65)-(432)[0-9]))\\b'
>> ds.str.match(reg).mean()
0.0

here is dask code.

>> import pandas as pd
>> from dask import dataframe

>> ds = pd.Series(["asdfa","afdsewr"],dtype=str)
>> df = dataframe.from_pandas(ds, npartitions=1)
>> reg = '\\b((((?!(000|666))([0-8][0-9][0-9]))((?!00)([0-9]{2}))((?!0000)([0-9]{4})))|(98765432[0-9]))\\b","\\b((((?!(000|666))([0-8][0-9][0-9]))[\\ ]((?!00)([0-9]{2}))[\\ ]((?!0000)([0-9]{4})))|((987)-(65)-(432)[0-9]))\\b","\\b((((?!(000|666))([0-8][0-9][0-9]))-((?!00)([0-9]{2}))-((?!0000)([0-9]{4})))|((987)-(65)-(432)[0-9]))\\b'
>> df.str.match(reg).mean()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.11/site-packages/dask/dataframe/accessor.py", line 15, in func
    return self._function_map(attr, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/dask_expr/_accessor.py", line 66, in _function_map
    return new_collection(
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/dask_expr/_collection.py", line 4764, in new_collection
    meta = expr._meta
           ^^^^^^^^^^
  File "/usr/lib64/python3.11/functools.py", line 1001, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/dask_expr/_accessor.py", line 108, in _meta
    return make_meta(self.operation(*args, **self._kwargs))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/dask_expr/_accessor.py", line 112, in operation
    out = getattr(getattr(obj, accessor, obj), attr)(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib64/python3.11/site-packages/pandas/core/strings/accessor.py", line 137, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib64/python3.11/site-packages/pandas/core/strings/accessor.py", line 1376, in match
    result = self._data.array._str_match(pat, case=case, flags=flags, na=na)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib64/python3.11/site-packages/pandas/core/arrays/string_arrow.py", line 431, in _str_match
    return self._str_contains(pat, case, flags, na, regex=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib64/python3.11/site-packages/pandas/core/arrays/string_arrow.py", line 357, in _str_contains
    result = pc.match_substring_regex(self._pa_array, pat, ignore_case=not case)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib64/python3.11/site-packages/pyarrow/compute.py", line 263, in wrapper
    return func.call(args, options, memory_pool)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_compute.pyx", line 385, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Invalid regular expression: invalid perl operator: (?!

pip3 list for dask==2023.5.0

pandas == 2.2.2
pyarrow == 16.1.0

pip3 list for dask==2024.1.0

pandas == 2.1.4
pyarrow == 14.0.2
pyarrow-hotfix == 0.6

I also tried to upgrade pandas and pyarrow version and try to match with dask==2023.5.0 version of dask still getting same error.

python version 3.11.7

Can someone help me with what I am doing wrong?

Hi @rvarunrathod,

In 2023, Dask has implemented the use of PyArrow String as a default when using DataFrame API. See PyArrow Strings in Dask DataFrames.

This means that your String are not object dtype like in Pandas, but pyarrow-string, which an lead to a different behavior.

I was able to reproduce your problem, a simple workaround is to disable this conversion:

import dask
dask.config.set({"dataframe.convert-string": False})

This is strange though, and I woud encourage you to open an issue, but I’m not sure from where your problem comes from:

  • Is your regex not standard?
  • Your environment with Dask 2024.1.0 is strange, with lower Pandas and pyarrow versions.
  • Is this a pyarrow or a Dask bug?

I have tested with the solution below, and it started working like before.

import dask
dask.config.set({"dataframe.convert-string": False})

Is your regex not standard?

I think Pyarrow supports different regex formats, with re modules all regex are working.

Your environment with Dask 2024.1.0 is strange, with lower Pandas and pyarrow versions.

I also notice this, but I just did pip3 install “dask[complete]”==version this depdency is installed by pip3.

Is this a pyarrow or a Dask bug?

I think this is not a bug. Pyarrow might be using a different regex format.

1 Like