Method 'acquire' of '_thread.lock' taking 90% of time

rvarunrathod · November 27, 2023, 8:26am

I am not using distributed client. first I load pandas dataframe to dask dataframe then do regex check on every column. I have multiple regex, so doing this operation multiple times and storing these futures into the dict to compute all this in the end.
here is the code.

futures_dict: Dict[str, Dict[str, float]] = {}
    for matcher in regex_list:
        column_name_and_mean: Dict[str, float] = {}
        for column_name in table_df.columns:
            column_name_and_mean[column_name] = table_df[column_name].str.match(matcher.regex).mean()
        futures_dict[matcher.name] = column_name_and_mean

dask.compute(futures_dict)

this is the flame graph from cprofile.

So, most of my time is used up in ‘acquire’ of ‘_thread.lock’. Can someone help me with this problem?
Thank you!

guillaumeeb · November 29, 2023, 10:48am

Hi @rvarunrathod, welcome to Dask community!

It would really help if you could provide a complete reproducer of your workflow, could you do that?

What happens, what is the performance if you just loop and do your computation using just Pandas, is it faster?

What happens if you try to use a distributed setup, with a LocalCluster?

I’m not entirely sure of what the profile you are getting means, maybe you are not getting the information from all the threads?

martindurant · November 29, 2023, 3:37pm

This is entirely expected: snakeviz/profile only looks at what your main thread is doing, but all of the compute work is happening in other threads, so it just shows “wait”. This is one of the reasons to use distributed even just locally (LocalCluster), because it gives much better diagnostic information.

There are some tools available for diagnostics without distributed, but you need to opt in to using them - see the docs.

Topic		Replies	Views
Dask Local Distributed vs Dataframe Distributed	1	55	August 28, 2024
Using futures and iterrows - optimal? Dask DataFrame delayed , future	3	108	May 16, 2024
Diagnostics for `DataFrame.compute()` Dask DataFrame distributed	9	2395	March 18, 2022
Multi-Threading on workers in Dask Distrubed (>2024.3.0) Distributed distributed	1	56	December 12, 2024
What is the pros/cons of using Futures/Delayed? Distributed delayed , future , distributed	3	2122	January 15, 2023

Method 'acquire' of '_thread.lock' taking 90% of time

Related topics