Hello Dask community,
I would like to handle “larger than memory” data set to compute a svd_compressed or a svd.
My input dataset is a mesh vertiex positions extracted “per frame” on some animations.
Typical dataset shape could be something like (10_000,1_000_000)
I have tried a lot of things, including using “delayed”, but nothin appears to solve the issue.
My input dataset is preprocessed to be transferred to Zarr format
Then I call
dask_array = da.from_zarr(zarrFile, chunks=(250,component_count))
U, S, VT = da.linalg.svd_compressed(dask_array, 1000)
SComputed = S.compute()
UComputed = U.compute()
Sadly, I can choose whatever chunk size or let it without chunk, the used memory goes well beyond what I expect.
With some basic tests on a smaller dataset (8000x75262), the process memory usage still goes as high as 16GB during the compute, while the U output size is 602MB and the chunk size is 150MB (in float64 precision).
With expected data set (10k,1M) , trying to compute S only :
- a chunk is 2GB
- the memory usage sky rocket directly way above my laptop 64GB after calling compute . this is valid using either svd() or svd_compressed()
My two goals are :
- to be able to compute the SVD
- to insure that the memory stays within a given margin.
After investigating on the way a PCA is computed out of memory, I modified the chunk sizes to be spread on both dimensions. It improved a little the memory usage, but it is still not tremendous.
I end up on the svd computation with errors such as :
OpenBLAS warning: precompiled NUM_THREADS exceeded,(even with a max set to 32 in my env vars)
What am I doing wrong and what can I do to reach these two goals?
Thank you !