Dask parallelist vs C++ parallelism: Shared memory?

SanPen · July 3, 2023, 7:43pm

Hi,

I am new to Dask and I would like to understand something before committing to it.

Context: I am the developer of a python based software called GridCal that deals with power systems simulation and optimization. Some of those topics may benefit greatly of parallelism. However in pure python, whenever you create a new thread you are essentially cloning the python interpreter.

Example: Say that I have in memory 10 GB of power system data that I want to simulate in chunks. I know that each simulation “chunk” takes 1GB.

In C++ due to shared-memory parallelism, if I run 5 concurrent chunks, the total memory usage is 10 GB + 5 x 1 GB = 15 GB. This is very nice for same-machine parallelism.

In python, the same process, due the inability to use shared-memory parallelism (thanks to the GIL), the total memory usage for the same example wold be 5 x (10 + 1) GB = 55 GB

Is parallelism with Dask the same as Python?

This is ok if I run the tasks in separated machines.

Thanks

guillaumeeb · July 4, 2023, 1:24pm

Hi @SanPen, welcome to Dask Discourse forum!

A first disclaimer, I’m not a true expert in Python memory handling and GIL.

But I’m not sure I understand what you are saying:

Here for example, what are these 5 x 1 GB memory? If the data is already in memory, why each threads needs another GB? Is this output data?

So I guess what you mean is that in order to avoid the GIL, you need to use multiprocessing, so in this case, you have to duplicate some things in memory.

Depending on your code, you might be able to bypass the GIL, and use multithreading or concurrency efficiently in Python.

I’m not sure why you need to duplicate all the input data though, cannot you just read a part of it in each process? Or you need the full 10GB to produce 1GB result each time?

Well yes, Dask is a pure Python library, so Python limitations apply.

SanPen · July 4, 2023, 3:34pm

Hi @guillaumeeb, I assume that the confusion comes from the following;

A lot of common uses of parallelization have to do with dense matrix operations, neural networks and the like. For those cases, the parallel function is usually very simple, think matrix multiplication. These operations align well with the CPU cache and benefit from on-chip operations. Usually you can get by without using auxiliary structures for the computation.

In my case, the parallel function is an optimization process, and as such it is not practical to allocate all the memory that it is going to use in advance, so the allocation happens dynamically. This is how the OS runs MS office and Chrome at the same time; Two different complex processes run in parallel.

How I do things in C++ is to store the input data in memory, split the input data in batches and run them in different threads, from which I collect the results aligning them in the same order as the inputs. When I tried this in python years ago using multiprocessing, for each thread, not only the simulation memory was allocated but a copy of the whole input memory as well. That is unacceptable because most PC’s cannot run simulations due to memory restrictions and that is why I went the C++ route; In C++ you can share the input memory and in python you cannot.

I hope this clarifies my question. My hope is that dask can share the input memory among the threads without copying it completely.

guillaumeeb · July 5, 2023, 6:58pm

So just a bonus question, couldn’t you just load into memory the input data batch that is needed by the ongoing simulation?

SanPen · July 5, 2023, 7:22pm

Yes, you can achieve that, keeping in mind to merge the results back in the correct order.

A commercial program in the field, splits the inputs to simulate in parallel but do not merge them back. I guess due to the difficulty of keeping an orchestrator alive (or maybe just bad design…)

guillaumeeb · July 5, 2023, 7:41pm

Well, that’s exactly what Dask can help you with.

crusaderky · July 12, 2023, 2:52pm

All scientific libraries in Python (numpy, scipy, pandas, etc.) work very hard to release the GIL as much as they can, so that you can have multiple threads per process. You will still have some GIL contention which makes it unwise to have processes with e.g. 32 threads. Dask offers tools to monitor how much GIL contention you have.

Topic		Replies	Views
Memory management in multi process vs thread-based parallelism Distributed	3	560	December 17, 2021
Tracking and Storing Memory Usage Per Task Distributed distributed , performance	6	40	July 31, 2024
Need help with efficient parallelization [local machine] Distributed delayed , distributed	2	252	July 30, 2022
Dask Local Distributed vs Dataframe Distributed	1	57	August 28, 2024
Memory accumulation using client.map - how can I avoid this? Distributed client	7	1759	March 22, 2022

Dask parallelist vs C++ parallelism: Shared memory?

Related topics