Best practice to distribute

guillaumeeb · January 11, 2022, 1:19pm

This depends on your computing context. Are you on your personal Laptop, on a single server, or on a compute cluster? How many resources (CPU, RAM) do you have ? What amount of memory does need each chunk ? Can this be lowered if you reduce chunks size?

In any case, you should configure your workers or your multiprocessing cluster accordingly.

I see you’re just using

client = Client()

which defaults to using all the machine capacity as number of process/threads and available memory. For example, say you have a laptop with 8 cores and 16GB memory, this will leads to:

4 worker processes of 2 threads (see distributed/utils.py at b9f657b1cdcede7fe80fe5db514c0abe08689bbe · dask/distributed · GitHub for details)
4GB of memory per process, so 2GB per thread.

With this setup, Dask will be able to launch 8 tasks at the same time (one per thread, so one per core), so in your case, 1 column per core with 2GB of memory available for the computation.

So the easier way is just to configure your Dask cluster (so your workers) according to the problem you’re trying to solve. If you need 4GB of memory for the operation on each chunk, then you should use something like:

from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=4, threads_per_worker=1)
client = Client(cluster)

You can also do more complex things using Dask Resources mechanism, but I would advise against it and try to keep it simple for now.

Topic		Replies	Views
Computing chunks locally before sending to workers with map_blocks Distributed	1	23	July 18, 2024
Use multiple workers to load data into dask.array Dask Array	7	1020	April 9, 2022
Parallelize or map chunks of arrays with different sizes, shapes and number of blocks Dask Array dask-array	4	635	July 31, 2023
How to properly extract features of a large array on cluster? Dask Array dask-array , distributed	1	271	November 9, 2022
Reading data slices from multiple HDF5 files Distributed dask-array , distributed	5	1395	January 4, 2022

Best practice to distribute

Related topics