SVD On Large Data Set (don't fit in memory), MxN same order

Hello Dask community,

I would like to handle “larger than memory” data set to compute a svd_compressed or a svd.
My input dataset is a mesh vertiex positions extracted “per frame” on some animations.

Typical dataset shape could be something like (10_000,1_000_000)

I have tried a lot of things, including using “delayed”, but nothin appears to solve the issue.
My input dataset is preprocessed to be transferred to Zarr format

Then I call

dask_array = da.from_zarr(zarrFile, chunks=(250,component_count))
U, S, VT = da.linalg.svd_compressed(dask_array, 1000)
SComputed = S.compute()
UComputed = U.compute()

Sadly, I can choose whatever chunk size or let it without chunk, the used memory goes well beyond what I expect.

With some basic tests on a smaller dataset (8000x75262), the process memory usage still goes as high as 16GB during the compute, while the U output size is 602MB and the chunk size is 150MB (in float64 precision).

With expected data set (10k,1M) , trying to compute S only :

  • a chunk is 2GB
  • the memory usage sky rocket directly way above my laptop 64GB after calling compute . this is valid using either svd() or svd_compressed()

My two goals are :

  • to be able to compute the SVD
  • to insure that the memory stays within a given margin.

After investigating on the way a PCA is computed out of memory, I modified the chunk sizes to be spread on both dimensions. It improved a little the memory usage, but it is still not tremendous.

I end up on the svd computation with errors such as :
OpenBLAS warning: precompiled NUM_THREADS exceeded,(even with a max set to 32 in my env vars)

What am I doing wrong and what can I do to reach these two goals?
Thank you !

Hi @Sebastien_Maraux, welcome to Dask community!

Do you think you could try to build some reproducer with randomly generated data?

Also, did you try to use a LocalCluster and look at the dashboard to have more insight on what was happening?

Hello Guillaume and thank you to take time to help !

Here is a really minimal case :
I would expect with these dimensions that :

  • the dataset is weighting 80GB
  • A chunk is 8 MB.
  • The algorithm probably needs at least a row and a column, thus 200 chunks or 1.6GB, to compute some matrices elements
  • The S asked as output is quite small (1k elements) and should not harm the memory

That little code ends up using more than 20 GB of memory (I stop before overflow), when I would expect it to stay within more reasonable bounds.

How could I get this to work in a constrained amount of memory (that I could chose ideally) ?

import dask.array as da
x = da.random.random(size=(10_000, 1_000_000), chunks=(100, 10_000))
# Note : standard SVD will fail because it is not a tall and skinny or small and fat dataset
u, s, v = da.linalg.svd_compressed(x, k=1000, compute=False)
s.compute()
print(s)

Just try running this code using a LocalCluster, and yes it loads a lot of data into memory (and spills to disk thanks to local cluster). As I don’t know the implementation of distributed SVD under the hood, I’m not sure on how to optimize.

Anyway, adding a LocalCluster will prevent memory to be overloaded, but in the cost of disk space and efficiency:

client = Client(
        LocalCluster(n_workers=2, threads_per_worker=2, memory_limit="4GiB"),
        set_as_default=True
    )

I tried it and dit it again from a fresh launched vscode, it complains about a port already in use and fails (have this log 4 times for a single launch) ? I have no other python running

2025-02-07 17:41:12,361 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
File “c:\Python311\Lib\site-packages\distributed\nanny.py”, line 452, in instantiate
result = await self.process.start()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File “c:\Python311\Lib\site-packages\distributed\nanny.py”, line 750, in start
await self.process.start()
File “c:\Python311\Lib\site-packages\distributed\process.py”, line 55, in _call_and_set_future
res = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “c:\Python311\Lib\site-packages\distributed\process.py”, line 215, in _start
process.start()
File “c:\Python311\Lib\multiprocessing\process.py”, line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File “c:\Python311\Lib\multiprocessing\context.py”, line 336, in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
File “c:\Python311\Lib\multiprocessing\popen_spawn_win32.py”, line 46, in init
prep_data = spawn.get_preparation_data(process_obj._name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “c:\Python311\Lib\multiprocessing\spawn.py”, line 164, in get_preparation_data
_check_not_importing_main()
File “c:\Python311\Lib\multiprocessing\spawn.py”, line 140, in _check_not_importing_main
raise RuntimeError(‘’’
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

    To fix this issue, refer to the "Safe importing of main module"
    section in https://docs.python.org/3/library/multiprocessing.html

Probably due to VSCode way of handling processes. You should try in a plain Jupyter notebook.

I have these errors even when launching directly from a command line wiht python. I will need to launch that code in that way, so this is a blocker to me. Any chance the python version I use could lead to an issue ? (3.11)

If it worth mentionning :slight_smile: Here are the modules I could install on current pip 25.0 wheel :
Dask 2025.1.0
Dask-ml 2024.4.4

All this on Windows 11

Well, this is not due to Python or Dask version, but probably more to your OS configuration then. You can probably use the no_nanny option of LocalCluster, but I guess there should be a better solution.