Hi,
We are trying to evaluate and match a large list of regexes into a new field over a dataset of approximately 100-150G.
After each regex match, there would be a progressively lesser number of regex to sort through. Do you have any recommendations on the type of operations to use to get this done efficiently?
We tried a few like str.match and dd.map_partition
We are running this over Dask 2023.8.0 using Kubernetes Cluster/Operator.
We have also seen task stream being stuck or largely empty. When looking at the Call Stack it seems to be regex compilation/search stage.
Thanks in advance.