Why my memory blows up even before the task starts to run?

Sry for framing this problem pretty vaguely. I have not found a way to reproduce this in concise code. My problem is: I am using dask array with numba optimzed code. Current implementation works well on relatively small scale data. However, when I increase the problem scale, my code can not even run.

When looking at my dask dashboard, I found that my memory blows up even before any task starts to run. I work intensively with numpy on a single machine with 500GB RAM. So I set Client(processes=False) when configuring my cluster. I think there won’t be any burden of copying around data between processes, because I am only launching one here.

What are the possible reason for this behavior? I don’t think it is because of my chunk size, which I have tried to set that large enough. Thanks in advance :slight_smile: Any suggestion appreciated!

It’s really hard to tell without at least a little code snippet, just how you read the data, which API you are using, the first few lines of Dask code…

It might be you are reading data client side and sending it to workers through the graph, or maybe workers are reading too much data but you should see that through the dashboard…