We are trying to evaluate and match a large list of regexes into a new field over a dataset of approximately 100-150G.
After each regex match, there would be a progressively lesser number of regex to sort through. Do you have any recommendations on the type of operations to use to get this done efficiently?
We are running this over Dask 2023.8.0 using Kubernetes Cluster/Operator.
We have also seen task stream being stuck or largely empty. When looking at the Call Stack it seems to be regex compilation/search stage.
Thanks in advance.