SVD On Large Data Set (don't fit in memory), MxN same order

Sebastien_Maraux · February 6, 2025, 8:52am

Hello Dask community,

I would like to handle “larger than memory” data set to compute a svd_compressed or a svd.
My input dataset is a mesh vertiex positions extracted “per frame” on some animations.

Typical dataset shape could be something like (10_000,1_000_000)

I have tried a lot of things, including using “delayed”, but nothin appears to solve the issue.
My input dataset is preprocessed to be transferred to Zarr format

Then I call

dask_array = da.from_zarr(zarrFile, chunks=(250,component_count))
U, S, VT = da.linalg.svd_compressed(dask_array, 1000)
SComputed = S.compute()
UComputed = U.compute()

Sadly, I can choose whatever chunk size or let it without chunk, the used memory goes well beyond what I expect.

With some basic tests on a smaller dataset (8000x75262), the process memory usage still goes as high as 16GB during the compute, while the U output size is 602MB and the chunk size is 150MB (in float64 precision).

With expected data set (10k,1M) , trying to compute S only :

a chunk is 2GB
the memory usage sky rocket directly way above my laptop 64GB after calling compute . this is valid using either svd() or svd_compressed()

My two goals are :

to be able to compute the SVD
to insure that the memory stays within a given margin.

After investigating on the way a PCA is computed out of memory, I modified the chunk sizes to be spread on both dimensions. It improved a little the memory usage, but it is still not tremendous.

I end up on the svd computation with errors such as :
OpenBLAS warning: precompiled NUM_THREADS exceeded,(even with a max set to 32 in my env vars)

What am I doing wrong and what can I do to reach these two goals?
Thank you !

guillaumeeb · February 7, 2025, 2:33pm

Hi @Sebastien_Maraux, welcome to Dask community!

Do you think you could try to build some reproducer with randomly generated data?

Also, did you try to use a LocalCluster and look at the dashboard to have more insight on what was happening?

Sebastien_Maraux · February 7, 2025, 3:33pm

Hello Guillaume and thank you to take time to help !

Here is a really minimal case :
I would expect with these dimensions that :

the dataset is weighting 80GB
A chunk is 8 MB.
The algorithm probably needs at least a row and a column, thus 200 chunks or 1.6GB, to compute some matrices elements
The S asked as output is quite small (1k elements) and should not harm the memory

That little code ends up using more than 20 GB of memory (I stop before overflow), when I would expect it to stay within more reasonable bounds.

How could I get this to work in a constrained amount of memory (that I could chose ideally) ?

import dask.array as da
x = da.random.random(size=(10_000, 1_000_000), chunks=(100, 10_000))
# Note : standard SVD will fail because it is not a tall and skinny or small and fat dataset
u, s, v = da.linalg.svd_compressed(x, k=1000, compute=False)
s.compute()
print(s)

guillaumeeb · February 7, 2025, 4:37pm

Just try running this code using a LocalCluster, and yes it loads a lot of data into memory (and spills to disk thanks to local cluster). As I don’t know the implementation of distributed SVD under the hood, I’m not sure on how to optimize.

Anyway, adding a LocalCluster will prevent memory to be overloaded, but in the cost of disk space and efficiency:

client = Client(
        LocalCluster(n_workers=2, threads_per_worker=2, memory_limit="4GiB"),
        set_as_default=True
    )

Sebastien_Maraux · February 7, 2025, 4:45pm

I tried it and dit it again from a fresh launched vscode, it complains about a port already in use and fails (have this log 4 times for a single launch) ? I have no other python running

2025-02-07 17:41:12,361 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
File “c:\Python311\Lib\site-packages\distributed\nanny.py”, line 452, in instantiate
result = await self.process.start()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File “c:\Python311\Lib\site-packages\distributed\nanny.py”, line 750, in start
await self.process.start()
File “c:\Python311\Lib\site-packages\distributed\process.py”, line 55, in _call_and_set_future
res = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “c:\Python311\Lib\site-packages\distributed\process.py”, line 215, in _start
process.start()
File “c:\Python311\Lib\multiprocessing\process.py”, line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File “c:\Python311\Lib\multiprocessing\context.py”, line 336, in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
File “c:\Python311\Lib\multiprocessing\popen_spawn_win32.py”, line 46, in init
prep_data = spawn.get_preparation_data(process_obj._name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “c:\Python311\Lib\multiprocessing\spawn.py”, line 164, in get_preparation_data
_check_not_importing_main()
File “c:\Python311\Lib\multiprocessing\spawn.py”, line 140, in _check_not_importing_main
raise RuntimeError(‘’’
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

    To fix this issue, refer to the "Safe importing of main module"
    section in https://docs.python.org/3/library/multiprocessing.html

guillaumeeb · February 7, 2025, 4:46pm

Probably due to VSCode way of handling processes. You should try in a plain Jupyter notebook.

Sebastien_Maraux · February 7, 2025, 4:52pm

I have these errors even when launching directly from a command line wiht python. I will need to launch that code in that way, so this is a blocker to me. Any chance the python version I use could lead to an issue ? (3.11)

Sebastien_Maraux · February 7, 2025, 5:13pm

If it worth mentionning Here are the modules I could install on current pip 25.0 wheel :
Dask 2025.1.0
Dask-ml 2024.4.4

All this on Windows 11

guillaumeeb · February 7, 2025, 5:22pm

Well, this is not due to Python or Dask version, but probably more to your OS configuration then. You can probably use the no_nanny option of LocalCluster, but I guess there should be a better solution.

Topic		Replies	Views
Not able to compute svd_compressed for bigger matrix Dask Array dask-array , future , distributed	13	53	December 11, 2024
Dask.array.svd_compressed() fails for calculating svd of complex matrix Dask Array dask-array	1	409	August 30, 2022
Help with dask/zarr usage -- performance issues with dask Dask Array zarr	10	1922	January 17, 2023
Image Segmentation Using Large Dask Array Dask Array zarr , xarray , distributed	11	280	August 31, 2023
Using da.delayed for Zarr processing: memory overhead & how to do it better? Dask Array dask-array , delayed	15	1820	January 16, 2024

SVD On Large Data Set (don't fit in memory), MxN same order

Related topics