Help using dask replace on rows meeting a regex condition

kbenny · October 19, 2022, 12:57pm

Hello All I’m having issues using doing a replace with my data. The closest thing that I’ve found to help is the dask.dataframe.Series.replace function. I want to replace a string in one column that meets a condition in another column. Here’s a sample CSV to give you an idea of what I’m dealing with:

Type,Indicator,Attribution Hash,abcdef0123456789abcdef0123456789,adversary1 Hash,abcdef0123456789abcdef0123456789,adversary2 Hash,abcdef0123456789abcdef0123456789,adversary3 Hash,abcdef0123456789abcdef0123456789abcdef01,adversary4 Hash,abcdef0123456789abcdef0123456789abcdef01,adversary5 Hash,abcdef0123456789abcdef0123456789abcdef01,adversary6 Hash,abcdef0123456789abcdef0123456789abcdef0123456789abcdef0123456789,adversary7 Hash,abcdef0123456789abcdef0123456789abcdef0123456789abcdef0123456789,adversary8 Hash,abcdef0123456789abcdef0123456789abcdef0123456789abcdef0123456789,adversary9

I’d like to get to this state:

Type,Indicator,Attribution hash_md5,abcdef0123456789abcdef0123456789,adversary1 hash_md5,abcdef0123456789abcdef0123456789,adversary2 hash_md5,abcdef0123456789abcdef0123456789,adversary3 hash_sha1,abcdef0123456789abcdef0123456789abcdef01,adversary4 hash_sha1,abcdef0123456789abcdef0123456789abcdef01,adversary5 hash_sha1,abcdef0123456789abcdef0123456789abcdef01,adversary6 hash_sha256,abcdef0123456789abcdef0123456789abcdef0123456789abcdef0123456789,adversary7 hash_sha256,abcdef0123456789abcdef0123456789abcdef0123456789abcdef0123456789,adversary8 hash_sha256,abcdef0123456789abcdef0123456789abcdef0123456789abcdef0123456789,adversary9

I have a replace regex to look for MD5 hashes in the Indicator column and I want to change it’s Type from Hash to hash_md5and perform the same process for the SHA1 and SHA256 hashes. What I’ve run into is I can make the replace, but it replaces in a series for all instances of Hash instead of just for that row where Indicator meets my regex condition/search. I hope this sound to confusing.

This is what has worked, but it replaces all instances of Hash to hash_md5 instead of just in the rows where there is a regex match:
ddf.replace({'Indicator': r'^[a-fA-F0-9]{32}%', 'Hash': 'hash_md5'}', regex=True)

Topic		Replies	Views
After upgrade in dask dataframe.str.match is giving error for same regex Dask DataFrame	2	80	June 18, 2024
Using pandas json_normalize Dask DataFrame	4	81	November 15, 2024
Dask loc not working : Cant able to use assign = operator with it Dask DataFrame	2	293	November 18, 2021
Using DataFrame apply in a loop Dask DataFrame	2	1205	August 5, 2022
Dask-sql NotImplementedError	6	196	March 28, 2024

Help using dask replace on rows meeting a regex condition

Related topics