Any recommendation on dealing with ragged data return in dask.delayed

baltun · September 14, 2022, 6:30pm

Do you have any experience in dealing with ragged data return in dask.delayed?

I utilize dask.distributed and dask.delayed to distribute the workload (see the script below). The return from the function (func) is a ragged data with multiple arrays having different dimensions. Right now I get all the arrays in a single tuple (results). Is there a nicer way where I get the corresponding arrays be stacked nicely as one would get in Dask.dataframe.

with Client(cluster) as client:
delayed_results = [ delayed(func)(x_train, y_train, idx) for idx in contexts]
results=compute(*delayed_results, scheduler=“processes”))

agoose77 · September 16, 2022, 2:05pm

Hi @baltun,

I’m a developer on the Awkward Array project, and a user of Dask for Physics analyses.

Awkward Array is designed to work with ragged arrays. There’s currently work on a Dask extension that allows the high-level Awkward Array API to run over dask. It’s very early days though, so it might not be easy to use for now.

Could you describe the shapes of your arrays and what shape you want the results to be? Is the result also ragged?

Topic		Replies	Views
How to properly use Dask delayed on a function that calls other functions Deploying Dask delayed	11	395	August 13, 2023
Passing dask objects to delayed computations without triggering compute Dask Array dask-array , delayed	2	387	January 20, 2023
Dask array, twice delayed Dask Array dask-array , distributed	6	791	February 23, 2022
Delayed dataframe computation Distributed dask-array , xarray , distributed	2	503	April 28, 2022
Seeking Feedback on Dask Implementation for Custom Function Application Dask DataFrame delayed	4	34	January 10, 2025

Any recommendation on dealing with ragged data return in dask.delayed

Related topics