Rechunking with balance=True does not result in balanced chunks

I’m trying to figure out why the rechunk() method, with balance=True, does not lead to balanced chunks when it’s pretty obvious how to balance them.

import numpy as np
import dask.array as da

# Millions of random integers
numbers = np.random.randint(
    low = 2, high = 200, size = int(1e8), dtype = np.int16).reshape((10000, 10000))

d_numbers = da.from_array(numbers).rechunk(balance = True)
>>> d_numbers.chunks
((8192, 1808), (8192, 1808))

In some cases I even get a UserWarning chunk size balancing not possible with given chunks. This, despite the fact that the chunks can very clearly be balanced by:

>>> d_numbers.rechunk((5000,5000)).chunks
((5000, 5000), (5000, 5000))

I’m teaching dask in a Python course, so I’m looking not just for a solution to get balanced chunks (I have that, above) but a way to explain why balance=True is not working here. Thanks!

Hi @arthur-e, welcome to Dask Discourse forum,

Well, I think this all comes down do this line: dask/dask/array/rechunk.py at main · dask/dask · GitHub.

You can try to follow the code, it is not that hard. Not sure why, but I think you are ending in an edge case with your initial unbalanced chunk sizes, and small chunk number.

This edge case might just be a bug, and may be the auto rebalance logic needs some improvements!