I am planning on creating a 12TB dask array (to be saved through xarray and zarr) and I’m thinking about the optimal chunking. The main operation (that I need to be as close to I/O speed as possible) that I’ll be conducting is random samples over the chunked axis (of length ~1billion). Will dask load an entire chunk from disk into memory to access a single element? If so, this implies I should have 1 chunk for every element over the axis (and thus 1billion chunks), no? This is obviously inefficient for other compute operations–but I care about random access speed above everything else.
P.S.
This is a little bit of an oversimplification, the actual array will be of shape roughly (3E6,256,3072) and I’ll reshape to (1E9,3072) and randomly sample over that first dimension. The 1 per element chunking is (3E6,256,-1).
Hi @muchanem, welcome to Dask community!
Yes it will.
No this doesn’t. 3072 is a really small chunk and few octets, you won’t win anything by doing that no matter your storage backend, and Dask won’t be able to handle 1 billion chunks.
Even if you do random samples, it might help to have several array loaded in memory. In any case, chunks should be at least a few MiB, up to about one undred in most cases. I would also advise against having over 100 000 chunks. So you can perform some trials, but I would say at least one MiB per chunk.
1 Like
I’ve been doing more reading about zarr and dask and I’ve seen the suggestion of using many zarr chunks for each dask chunk. My plan now is >100,000 zarr chunks but only ~1,500 dask chunks over the first axis. Are there any obvious pitfalls in this approach? (although maybe a zarr forum is a more appropriate place for this question)
There shoulnd’t be with Dask, I’m not sure how it works on Zarr side though.