Ideal way to sweep search regex lists

jerrygb · November 1, 2023, 11:07am

Hi,

We are trying to evaluate and match a large list of regexes into a new field over a dataset of approximately 100-150G.

After each regex match, there would be a progressively lesser number of regex to sort through. Do you have any recommendations on the type of operations to use to get this done efficiently?

We tried a few like str.match and dd.map_partition

We are running this over Dask 2023.8.0 using Kubernetes Cluster/Operator.

We have also seen task stream being stuck or largely empty. When looking at the Call Stack it seems to be regex compilation/search stage.

Thanks in advance.

guillaumeeb · November 3, 2023, 3:28pm

Hi @jerrygb,

Could you provide some reproducible example? This would help to understand what you are trying to achieve and what is the limitation.

Do you have some working code using Pandas? Is your dataset in tabular format?

Topic		Replies	Views
Method 'acquire' of '_thread.lock' taking 90% of time Dask DataFrame	2	822	November 29, 2023
After upgrade in dask dataframe.str.match is giving error for same regex Dask DataFrame	2	81	June 18, 2024
Run dask in parallel doesn't work as expected, in distributed kubernetes pods Distributed	11	484	March 17, 2023
Efficient compute for new data? Distributed	1	212	June 23, 2022
Optimizing Dask Delayed Pandas DataFrames for Large-Scale Data Processing - Emmanuel Katto Dask DataFrame delayed	3	83	September 19, 2024

Ideal way to sweep search regex lists

Related topics