Ideal way to sweep search regex lists


We are trying to evaluate and match a large list of regexes into a new field over a dataset of approximately 100-150G.

After each regex match, there would be a progressively lesser number of regex to sort through. Do you have any recommendations on the type of operations to use to get this done efficiently?

We tried a few like str.match and dd.map_partition

We are running this over Dask 2023.8.0 using Kubernetes Cluster/Operator.

We have also seen task stream being stuck or largely empty. When looking at the Call Stack it seems to be regex compilation/search stage.

Thanks in advance.

Hi @jerrygb,

Could you provide some reproducible example? This would help to understand what you are trying to achieve and what is the limitation.

Do you have some working code using Pandas? Is your dataset in tabular format?