Best approach to compute on many slices from dask array and generate data frame

joshua-gould · August 14, 2024, 6:25pm

I have a computation that operates on slices from a dask array and creates a data frame per slice. I’m currently using delayed and dd.from_delayed (which might be deprecated in the future) to create a data frame. Is their a better approach either using dd.from_map or another method? Thanks.

import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask import delayed


def test_func(x):
    # create a dataframe from slice. In actuality, function is much more complicated
    return pd.DataFrame(data=dict(mean=[np.mean(x)]))


array = da.random.random((100, 100), chunks=(10, 10))
slices = [(slice(0, 2), slice(10, 12)), (slice(10, 20), slice(1, 12))]

delayed_results = []
for sl in slices:
    delayed_results.append(delayed(test_func)(array[sl]))
df = dd.from_delayed(delayed_results).compute()

guillaumeeb · August 15, 2024, 5:30pm

Hi @joshua-gould,

I guess the real question is : why from_delayed might be deprecated in the future. Maybe @fjetter or @crusaderky have an answer for this?

I find you approach nice for this problem…

joshua-gould · May 23, 2025, 12:21pm

This code creates a large dask graph. Is there a way to accomplish this without creating a large graph? Thanks.

guillaumeeb · May 23, 2025, 2:21pm

Which code, your example above is not creating a large graph, is it?

joshua-gould · May 23, 2025, 2:52pm

This pattern does create a large graph in you make the dask array larger and extract more slices:

import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask import delayed
from distributed import Client


def test_func(x):
    # create a dataframe from slice. In actuality, function is much more complicated
    return pd.DataFrame(data=dict(mean=[np.mean(x)]))


with Client() as client:
    delayed_results = []
    array = da.random.random((1000, 1000), chunks=(10, 10))
    for sl in da.core.slices_from_chunks(array.chunks):
        delayed_results.append(delayed(test_func)(array[sl]))
    df = dd.from_delayed(delayed_results).compute()

## UserWarning: Sending large graph of size 33.94 MiB.
# This may cause some slowdown.
# Consider loading the data with Dask directly
# or using futures or delayed objects to embed the data into the graph without repetition.

guillaumeeb · May 29, 2025, 3:50pm

OK, I tried your code and get the Warning. I don’t think there is a real problem in the graph (no data inside). I would probably have iterated on array partitions or used to_delayed instead of using slices, but this is probably equivalent in number of tasks.

In the end, it just results in a graph with 50000 tasks, which is probably the cause of the warning. This is not that bad if you’re sure of what your doing. The only way, and this is always a correct advice, is to try to reduce the number of tasks, and thus here probably the number of chunks.

Topic		Replies	Views
Dask array, twice delayed Dask Array dask-array , distributed	6	793	February 23, 2022
Writing very large dataframes with a sorted index Dask DataFrame	9	1224	February 7, 2023
Best method to create a Dataframe with calculated data added to it Dask DataFrame	2	312	April 9, 2022
Question: if I am mixing dask.delayed functions and using dask dataframes, are there any caveats to be aware of? Dask DataFrame delayed	5	730	August 21, 2023
@delayed(nout=2) equivalent in dd.from_map? Dask DataFrame	5	35	November 22, 2024

Best approach to compute on many slices from dask array and generate data frame

Related topics