Understanding How Dask is Executing Processes vs Threads

Hello everyone! I was curious about something I noticed while staring at htop on my machine while working on this project.

I noticed that performing that work was taking a fair bit longer than I had anticipated on longer videos and, until I get a better understanding of how task-graphs are being executed and profiling things with Dask’s packages, I thought I would just look at htop while things were going.

Here’s a screenshot of what I see on the computer I’m using:

An HPC expert in the lab told me to check out the “S” column which represents what the processes are up to. He taught me that “D” means it’s Dead, “S” means Sleeping, and “R” means running. He also told me that ideally, you would see the processors all doing basically 100% or close to 0% and that the fact there’s so many things running at just 20-30% means there’s probably something going wrong in the code somewhere. There sure are an awful lot of S’s in the que it seems!

So I wanted to ask if this is expected behavior and, if not, what can I do to interrogate things better? Is it because Dask schedules the workers and sleeps them until it’s their turn to be processed? Or something else?

@jmdelahanty

See this post.

1 Like

To summarize that post, it’s basically that:

  • Python GIL is not released for the threaded operations of reading video files.

One question that I think would be more appropriate in this thread is this:

Is there a way to tell generally when something can benefit from threading vs processing? It’s been confusing to me generally and I’m realizing the understanding I thought I had isn’t really that great.

1 Like

Since numpy often releases the GIL, threading should accordingly lead to performance gains. But in your case, there is also a python for-loop (looping the function hog() over the frames in each chunk) which probably interferes with the full release of the GIL, and which is probably why the multi-process scheduler might gain an edge over multi-threading here.

This is just a guess. So someone might need to prove its veracity.

1 Like

Got it, so that would be why altering how hog() in that case works so it can operate on entire chunks at once instead of on a frame by frame basis would maybe allow threading to be used.

Is it a general rule that threads are faster than processes overall?

1 Like

Exactly. At least, that’s how I see it. :smile:

As far as I understand it, multi-processing generally incurs an overhead when processes communicate with each other in order to share data. This overhead is typically absent in multi-threading, as different threads are able to share/access the same memory locations. This may be why multi-threading, when unobstructed by the GIL, is often faster than multi-processing.

Your HOG application, however, is embarrassingly parallel, implying that each chunk is processed almost completely independently of the other chunks. (You’ll most likely see this by visualizing the task-graph of hog_images or hog_descriptors.) So there is hardly any inter-process communication, which is why multi-processing is particularly performant here.

See the following links:
Start at 2:39:24 in YouTube Video: Parallel and Distributed Computing in Python with Dask | SciPy 2020 | Bourbeau, McCarty, Pothina
which is based on the following sample jupyter notebook: follow this link.

1 Like