I have a computation that operates on slices from a dask array and creates a data frame per slice. I’m currently using delayed and dd.from_delayed (which might be deprecated in the future) to create a data frame. Is their a better approach either using dd.from_map or another method? Thanks.
import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask import delayed
def test_func(x):
# create a dataframe from slice. In actuality, function is much more complicated
return pd.DataFrame(data=dict(mean=[np.mean(x)]))
array = da.random.random((100, 100), chunks=(10, 10))
slices = [(slice(0, 2), slice(10, 12)), (slice(10, 20), slice(1, 12))]
delayed_results = []
for sl in slices:
delayed_results.append(delayed(test_func)(array[sl]))
df = dd.from_delayed(delayed_results).compute()
Hi @joshua-gould,
I guess the real question is : why from_delayed might be deprecated in the future. Maybe @fjetter or @crusaderky have an answer for this?
I find you approach nice for this problem…
This code creates a large dask graph. Is there a way to accomplish this without creating a large graph? Thanks.
Which code, your example above is not creating a large graph, is it?
This pattern does create a large graph in you make the dask array larger and extract more slices:
import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask import delayed
from distributed import Client
def test_func(x):
# create a dataframe from slice. In actuality, function is much more complicated
return pd.DataFrame(data=dict(mean=[np.mean(x)]))
with Client() as client:
delayed_results = []
array = da.random.random((1000, 1000), chunks=(10, 10))
for sl in da.core.slices_from_chunks(array.chunks):
delayed_results.append(delayed(test_func)(array[sl]))
df = dd.from_delayed(delayed_results).compute()
## UserWarning: Sending large graph of size 33.94 MiB.
# This may cause some slowdown.
# Consider loading the data with Dask directly
# or using futures or delayed objects to embed the data into the graph without repetition.
OK, I tried your code and get the Warning. I don’t think there is a real problem in the graph (no data inside). I would probably have iterated on array partitions or used to_delayed instead of using slices, but this is probably equivalent in number of tasks.
In the end, it just results in a graph with 50000 tasks, which is probably the cause of the warning. This is not that bad if you’re sure of what your doing. The only way, and this is always a correct advice, is to try to reduce the number of tasks, and thus here probably the number of chunks.
1 Like