Dask.compute() not fetching all the results

10PriyaA · December 20, 2022, 11:28am

Hello everyone,

I recently came across dask.delayed to parallelize my code. I have six functions which can execute independently. The response from each function is a pandas dataframe, and once all the six dataframes are returned, I simply concatenate them to further transform it to get my final desired output. To parallelize my code, I annotated each function with @dask.delayed. Another delayed function takes all six dataframes as input and concats them before sending the final output.

When I execute the compute statement on the final function, the expected behavior would be to retrieve a combined dataframe. Once the compute is complete, then the rest of the lines of code would resume to execute and transform this combined dataframe. However, many a times, one or the other of the six dataframes (random) is not present in the final df which is causing my subsequent flow to break.

If I am understanding it right, would it be due to the fact that final_df is getting created before all the six dataframes were retrieved? And if so, how can I ensure that the final function only executes once all six functions have run successfully and have returned respective dataframes?

crusaderky · January 2, 2023, 4:54pm

What you describe should work in theory, there must be something wrong in your code. Please post an example.

However, the way you’re approaching the problem is suboptimal - after you concatenate the six dataframes, you will have to pay for their full size in RAM and you won’t be able to parallelize anything.
You should use dask.dataframe.from_delayed to generate a dask.dataframe instead and enjoy full parallelism afterwards.

Topic		Replies	Views
Question: if I am mixing dask.delayed functions and using dask dataframes, are there any caveats to be aware of? Dask DataFrame delayed	5	741	August 21, 2023
Help to check my delayed methods with Dask dataframe Dask DataFrame delayed	2	125	April 1, 2024
Running DataFrame Partition Simulations in Parallel using dask.delayed() Dask DataFrame delayed	2	287	September 27, 2023
Dask Tutorial dask_delayed what's are they asking here? Dask DataFrame	4	222	May 31, 2023
How to handle a Dask DF in multiple modules? Dask DataFrame	6	576	February 8, 2023

Dask.compute() not fetching all the results

Related topics