We are currently seeing very high memory usage from the scheduler when trying to compute things for large dataframe with around 5000 partitions and 5000 columns.
I tried to build the example below to replicate the behavior:
Trying with a few values for nb_paths and nb_columns we observed the following memory usage:
Number partitions
Number of columns
Scheduler memory usage (MiB)
10
10
212
100
10
214
1000
10
232
10000
10
398
10
100
214
10
1000
215
10
10000
245
100
100
214
1000
1000
232
10000
10000
3500
It looks like we have a multiplicative relation between n_partitions * n_columns for memory usage. Is it possible that we are keeping in the scheduler’s memory one meta for each partition that needs to be computed, even though they are all the same?
I didn’t try the code and looked into it in detail yet.
But yes, the scheduler will track tasks that needs to be computed or are being computed, which are usually in a proportional number to the number of partitions.
I understand that the scheduler needs to keep in memory the tasks known to the scheduler.
But if all these tasks have results with the same meta, are they all using the same object in memory to store this information? Or does each one of the them have its own copy of the meta object?
In the first case, we would need only 1 object in memory for all tasks. While in the second, we would need n_partitions objects.