Predict the memory when use dask distributed schedular

I have a dask graph like {‘A’ : dataframe, ‘B’ : (func, A), ‘C’ : (func, B) …}
The graph has so many tasks so I run it fast by dask on ray.
The ray just used for memory and store object. Dask run the schedular.
Here is a question. I read scheduling in Depth in dask docs and I know how it works when static.
But when it`s running by distribute(or on ray) with many cpus , how can I predict the memory use.
(we can know object A ,B… = 1000Mb. )
Why I need predict memory? I need to start ray to store object. If I start ray in a small memory , it failed.

How can we predict total memory if we have a dask costom graph and know every task`s memory?

@Asuka Thanks for the question! I don’t think there’s a way to predict memory from the task graph. Without knowing much about your workflow or system, I’d say you can try increasing the number of partitions in your DataFrame and see if it’s more friendly to your memory capacity?

Again, I think this question may be better suited for the Ray community.

As an aside, we usually recommend using the Worker Memory plots that come with Dask Distributed’s Dashboard to diagnose how memory is being managed by the scheduler, so if you switch to the distributed scheduler, you can use that.

2 Likes