Dask parallelist vs C++ parallelism: Shared memory?

Hi,

I am new to Dask and I would like to understand something before committing to it.

Context: I am the developer of a python based software called GridCal that deals with power systems simulation and optimization. Some of those topics may benefit greatly of parallelism. However in pure python, whenever you create a new thread you are essentially cloning the python interpreter.

Example: Say that I have in memory 10 GB of power system data that I want to simulate in chunks. I know that each simulation “chunk” takes 1GB.

In C++ due to shared-memory parallelism, if I run 5 concurrent chunks, the total memory usage is 10 GB + 5 x 1 GB = 15 GB. This is very nice for same-machine parallelism.

In python, the same process, due the inability to use shared-memory parallelism (thanks to the GIL), the total memory usage for the same example wold be 5 x (10 + 1) GB = 55 GB

Is parallelism with Dask the same as Python?

This is ok if I run the tasks in separated machines.

Thanks

Hi @SanPen, welcome to Dask Discourse forum!

A first disclaimer, I’m not a true expert in Python memory handling and GIL.

But I’m not sure I understand what you are saying:

Here for example, what are these 5 x 1 GB memory? If the data is already in memory, why each threads needs another GB? Is this output data?

So I guess what you mean is that in order to avoid the GIL, you need to use multiprocessing, so in this case, you have to duplicate some things in memory.

Depending on your code, you might be able to bypass the GIL, and use multithreading or concurrency efficiently in Python.

I’m not sure why you need to duplicate all the input data though, cannot you just read a part of it in each process? Or you need the full 10GB to produce 1GB result each time?

Well yes, Dask is a pure Python library, so Python limitations apply.

Hi @guillaumeeb, I assume that the confusion comes from the following;

A lot of common uses of parallelization have to do with dense matrix operations, neural networks and the like. For those cases, the parallel function is usually very simple, think matrix multiplication. These operations align well with the CPU cache and benefit from on-chip operations. Usually you can get by without using auxiliary structures for the computation.

In my case, the parallel function is an optimization process, and as such it is not practical to allocate all the memory that it is going to use in advance, so the allocation happens dynamically. This is how the OS runs MS office and Chrome at the same time; Two different complex processes run in parallel.

How I do things in C++ is to store the input data in memory, split the input data in batches and run them in different threads, from which I collect the results aligning them in the same order as the inputs. When I tried this in python years ago using multiprocessing, for each thread, not only the simulation memory was allocated but a copy of the whole input memory as well. That is unacceptable because most PC’s cannot run simulations due to memory restrictions and that is why I went the C++ route; In C++ you can share the input memory and in python you cannot.

I hope this clarifies my question. My hope is that dask can share the input memory among the threads without copying it completely.

So just a bonus question, couldn’t you just load into memory the input data batch that is needed by the ongoing simulation?

Yes, you can achieve that, keeping in mind to merge the results back in the correct order.

A commercial program in the field, splits the inputs to simulate in parallel but do not merge them back. I guess due to the difficulty of keeping an orchestrator alive (or maybe just bad design…)

Well, that’s exactly what Dask can help you with.

All scientific libraries in Python (numpy, scipy, pandas, etc.) work very hard to release the GIL as much as they can, so that you can have multiple threads per process. You will still have some GIL contention which makes it unwise to have processes with e.g. 32 threads. Dask offers tools to monitor how much GIL contention you have.