I have upgraded dask version from 2023.5.0 to 2024.1.0.
the same regex is not giving error for pandas. I don’t know what get changed in dask after upgrade.
here is pandas code.
>> import pandas as pd
>> ds = pd.Series(["asdfa","afdsewr"],dtype=str)
>> reg = '\\b((((?!(000|666))([0-8][0-9][0-9]))((?!00)([0-9]{2}))((?!0000)([0-9]{4})))|(98765432[0-9]))\\b","\\b((((?!(000|666))([0-8][0-9][0-9]))[\\ ]((?!00)([0-9]{2}))[\\ ]((?!0000)([0-9]{4})))|((987)-(65)-(432)[0-9]))\\b","\\b((((?!(000|666))([0-8][0-9][0-9]))-((?!00)([0-9]{2}))-((?!0000)([0-9]{4})))|((987)-(65)-(432)[0-9]))\\b'
>> ds.str.match(reg).mean()
0.0
here is dask code.
>> import pandas as pd
>> from dask import dataframe
>> ds = pd.Series(["asdfa","afdsewr"],dtype=str)
>> df = dataframe.from_pandas(ds, npartitions=1)
>> reg = '\\b((((?!(000|666))([0-8][0-9][0-9]))((?!00)([0-9]{2}))((?!0000)([0-9]{4})))|(98765432[0-9]))\\b","\\b((((?!(000|666))([0-8][0-9][0-9]))[\\ ]((?!00)([0-9]{2}))[\\ ]((?!0000)([0-9]{4})))|((987)-(65)-(432)[0-9]))\\b","\\b((((?!(000|666))([0-8][0-9][0-9]))-((?!00)([0-9]{2}))-((?!0000)([0-9]{4})))|((987)-(65)-(432)[0-9]))\\b'
>> df.str.match(reg).mean()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.11/site-packages/dask/dataframe/accessor.py", line 15, in func
return self._function_map(attr, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dask_expr/_accessor.py", line 66, in _function_map
return new_collection(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dask_expr/_collection.py", line 4764, in new_collection
meta = expr._meta
^^^^^^^^^^
File "/usr/lib64/python3.11/functools.py", line 1001, in __get__
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dask_expr/_accessor.py", line 108, in _meta
return make_meta(self.operation(*args, **self._kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dask_expr/_accessor.py", line 112, in operation
out = getattr(getattr(obj, accessor, obj), attr)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib64/python3.11/site-packages/pandas/core/strings/accessor.py", line 137, in wrapper
return func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib64/python3.11/site-packages/pandas/core/strings/accessor.py", line 1376, in match
result = self._data.array._str_match(pat, case=case, flags=flags, na=na)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib64/python3.11/site-packages/pandas/core/arrays/string_arrow.py", line 431, in _str_match
return self._str_contains(pat, case, flags, na, regex=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib64/python3.11/site-packages/pandas/core/arrays/string_arrow.py", line 357, in _str_contains
result = pc.match_substring_regex(self._pa_array, pat, ignore_case=not case)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib64/python3.11/site-packages/pyarrow/compute.py", line 263, in wrapper
return func.call(args, options, memory_pool)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_compute.pyx", line 385, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Invalid regular expression: invalid perl operator: (?!
pip3 list for dask==2023.5.0
pandas == 2.2.2
pyarrow == 16.1.0
pip3 list for dask==2024.1.0
pandas == 2.1.4
pyarrow == 14.0.2
pyarrow-hotfix == 0.6
I also tried to upgrade pandas and pyarrow version and try to match with dask==2023.5.0 version of dask still getting same error.
python version 3.11.7
Can someone help me with what I am doing wrong?