Weird holes in Dask Distributed status - how to investigate?

Hello!

I’ve used the status service to investigate sloweness of our workload (some synthetic benchmark for Modin), and I’m having troubles on finding the tools to investigate what’s going on.

I’m attaching a sample pic where you can see a lot of holes in the activity graph, and I think that my workload wasn’t actually having so many idle times or running on the main process - it should be running things in workers.

My current assumption is my code is waiting too much on some objects to be available while it should just move forward, but I don’t have any measurements to back this hypothesis.

So how do I investigate the performance issues - any tools besides status page?

I’m running this on a single node, so it’s a LocalCluster setup. I can use tools on Windows or Linux, so any recommendations would be awesome.

Thanks in advance!

Hi @vnlitvinov,

Could you produce a MWE, or at least show with some code what your workflow looks like?

Seeing this graph, I’d say that there is too much pressure on the scheduler (too many small tasks) and so it cannot keep up sending tasks to workers. This might also mean that you are transferring some serialized Python objects and it takes too much time compared to the actual computational part. Hard to tell without more inputs, but I’ll definitely look on the Scheduler side (enabling more logging, or with some profiling?).

You could also try to see if things behave differently when using threads instead of processes (if you currently use processes).

Hi @guillaumeeb, thanks for the response!

I will compose a repro soon.

I wonder if there is some tooling that would help me see (ideally measure) this scheduler or serialization pressure. I’m fine with using even proof-of-concept quality things as I have certain background in building them :slight_smile: otherwise I’d have to actually re-invent some wheels to get things moving.

Unfortunately using threads is not an option for me, as the code I’m parallelizing is mostly CPU-bound (and implemented in Python), hence GIL would kill all the parallelization profit.

I’ve never deeply analyzed Scheduler activity, but here are a few things.

You can have a bit of information with performance reports:
https://distributed.dask.org/en/stable/diagnosing-performance.html#performance-reports

It gives a Scheduler activity profile.

You’ve also a lot of inputs here:
https://docs.dask.org/en/latest/how-to/debug.html

How to turn on loggind, how to use single threaded scheduler, etc.

One other tool that might be useful:

1 Like