Optimal Chunking for Random Access

muchanem · July 18, 2024, 9:11pm

I am planning on creating a 12TB dask array (to be saved through xarray and zarr) and I’m thinking about the optimal chunking. The main operation (that I need to be as close to I/O speed as possible) that I’ll be conducting is random samples over the chunked axis (of length ~1billion). Will dask load an entire chunk from disk into memory to access a single element? If so, this implies I should have 1 chunk for every element over the axis (and thus 1billion chunks), no? This is obviously inefficient for other compute operations–but I care about random access speed above everything else.

P.S.
This is a little bit of an oversimplification, the actual array will be of shape roughly (3E6,256,3072) and I’ll reshape to (1E9,3072) and randomly sample over that first dimension. The 1 per element chunking is (3E6,256,-1).

guillaumeeb · July 19, 2024, 5:15pm

Hi @muchanem, welcome to Dask community!

Yes it will.

No this doesn’t. 3072 is a really small chunk and few octets, you won’t win anything by doing that no matter your storage backend, and Dask won’t be able to handle 1 billion chunks.

Even if you do random samples, it might help to have several array loaded in memory. In any case, chunks should be at least a few MiB, up to about one undred in most cases. I would also advise against having over 100 000 chunks. So you can perform some trials, but I would say at least one MiB per chunk.

muchanem · July 19, 2024, 9:28pm

I’ve been doing more reading about zarr and dask and I’ve seen the suggestion of using many zarr chunks for each dask chunk. My plan now is >100,000 zarr chunks but only ~1,500 dask chunks over the first axis. Are there any obvious pitfalls in this approach? (although maybe a zarr forum is a more appropriate place for this question)

guillaumeeb · July 24, 2024, 8:47am

There shoulnd’t be with Dask, I’m not sure how it works on Zarr side though.

Topic		Replies	Views
Help with dask/zarr usage -- performance issues with dask Dask Array zarr	10	1993	January 17, 2023
Difference in loading performance between dask array and numpy/joblib Dask Array zarr , numpy	6	375	June 21, 2023
Many task transfers during reshaping/rechunking of array Dask Array distributed	1	245	January 16, 2023
Best practice to distribute Distributed distributed	5	568	January 11, 2022
Unexpected behaviour when saving dask.array after reshape Dask Array dask-array	6	373	July 26, 2023

Optimal Chunking for Random Access

Related topics