Using futures and iterrows - optimal?

astrhug · May 15, 2024, 4:14pm

Hi, I am not sure if the following setup for using dask is correct and whether I should be using futures.

I build my dask cluster as so and import my data

n_workers = 8
print(f"Creating cluster with {n_workers} workers")
cluster = LocalCluster(n_workers=n_workers, threads_per_worker = 1)
client = Client(cluster)
d = dld.csv_load(filename) ## I have to use this custom csv_load function which creates a pandas df
ddf = dd.from_pandas(d, npartitions=1) ## to test whether order is preserved when writing to parquet

I then have a custom func (that calls a whole bunch of other custom modules) that does some computation row-by-row

def process_row(row):
        new_df = pd.DataFrame(row).transpose().reset_index(drop=True)
....... # does some computations and creates a dataframe for that row which is probably ridiculous in this setup
        results_df = pd.DataFrame({
            **summary,
            'R': [q_dict['qz']],
            'Ch': [q_dict['chisq']]
        })
       return results_df

I then use futures as such.

futures = []
for _, row in ddf.iterrows():
    future = client.submit(process_row,row)
    futures.append(future)
  
ddfg = dd.from_delayed(futures)
ddfg.to_parquet(outname, engine='pyarrow', write_index=False, compute=True)

The computation is incredibly slow and I am not sure if the order is preserved when writing to parquet.
Any thoughts/help would be greatly appreciated, thanks.

guillaumeeb · May 16, 2024, 9:15am

Hi @astrhug, welcome to Dask Discourse forum!

Building a Dask Dataframe and iterating on its row on Client side is useless in your example. But this does not explain the slowness, if processing each row takes a sufficient amount of time (e.g. at least about 1s), you should see some speed up.

The main parallelization can be done with two concepts:

Either Dask DataFrame and something like apply or map_partitions,
Either looping on a Pandas Dataframe and submitting Future for each row as you doing.

Some questions;

How many rows do you have in the input DataFrame?
How much time takes the processing of one row?
Why do are yo saying it is incredibly slow? Do you have a Pandas way of doing thing that is faster?

astrhug · May 16, 2024, 9:28am

Hi @guillaumeeb , I appreciate the quick reply!

Building a Dask Dataframe and iterating on its row on Client side is useless in your example.

Are you saying to avoid this line?

ddf = dd.from_pandas(d, npartitions=1)

Rows: Varies but I have been testing with < 5k and eventually it will be in the millions
Time: I am not sure but It takes 40 seconds per row in the non-parallelised way.
Pandas: I have inherited somebody else’s code to parallelise/speed=up.

Thank you

guillaumeeb · May 16, 2024, 12:01pm

Yes, in your current code it does not make sense.

So it should be perfectly fine to submit a future per row.

However, if the length goes above the million lines, then it will result in a lot of burden on Scheduler side.

So is the Pandas way faster? How does the code look using Pandas?

Topic		Replies	Views
Writing very large dataframes with a sorted index Dask DataFrame	9	1180	February 7, 2023
What is the pros/cons of using Futures/Delayed? Distributed delayed , future , distributed	3	1955	January 15, 2023
Seeking Feedback on Dask Implementation for Custom Function Application Dask DataFrame delayed	4	26	January 10, 2025
Help to check my delayed methods with Dask dataframe Dask DataFrame delayed	2	108	April 1, 2024
Multi-Threading on workers in Dask Distrubed (>2024.3.0) Distributed distributed	1	27	December 12, 2024

Using futures and iterrows - optimal?

Related topics