Weird holes in Dask Distributed status - how to investigate?

vnlitvinov · November 1, 2022, 4:11pm

Hello!

I’ve used the status service to investigate sloweness of our workload (some synthetic benchmark for Modin), and I’m having troubles on finding the tools to investigate what’s going on.

I’m attaching a sample pic where you can see a lot of holes in the activity graph, and I think that my workload wasn’t actually having so many idle times or running on the main process - it should be running things in workers.

My current assumption is my code is waiting too much on some objects to be available while it should just move forward, but I don’t have any measurements to back this hypothesis.

So how do I investigate the performance issues - any tools besides status page?

I’m running this on a single node, so it’s a LocalCluster setup. I can use tools on Windows or Linux, so any recommendations would be awesome.

Thanks in advance!

guillaumeeb · November 16, 2022, 11:18am

Hi @vnlitvinov,

Could you produce a MWE, or at least show with some code what your workflow looks like?

Seeing this graph, I’d say that there is too much pressure on the scheduler (too many small tasks) and so it cannot keep up sending tasks to workers. This might also mean that you are transferring some serialized Python objects and it takes too much time compared to the actual computational part. Hard to tell without more inputs, but I’ll definitely look on the Scheduler side (enabling more logging, or with some profiling?).

You could also try to see if things behave differently when using threads instead of processes (if you currently use processes).

vnlitvinov · November 16, 2022, 11:37am

Hi @guillaumeeb, thanks for the response!

I will compose a repro soon.

I wonder if there is some tooling that would help me see (ideally measure) this scheduler or serialization pressure. I’m fine with using even proof-of-concept quality things as I have certain background in building them otherwise I’d have to actually re-invent some wheels to get things moving.

Unfortunately using threads is not an option for me, as the code I’m parallelizing is mostly CPU-bound (and implemented in Python), hence GIL would kill all the parallelization profit.

guillaumeeb · November 16, 2022, 1:50pm

I’ve never deeply analyzed Scheduler activity, but here are a few things.

You can have a bit of information with performance reports:
https://distributed.dask.org/en/stable/diagnosing-performance.html#performance-reports

It gives a Scheduler activity profile.

You’ve also a lot of inputs here:
https://docs.dask.org/en/latest/how-to/debug.html

How to turn on loggind, how to use single threaded scheduler, etc.

One other tool that might be useful:

Topic		Replies	Views
Dask distributed performance issues Distributed kubernetes , future , distributed	1	245	December 7, 2022
Scheduler stuck, unique keys runs slow with time Distributed scheduler	6	1678	June 2, 2022
Understanding How Dask is Executing Processes vs Threads Dask Array dask-array , distributed	5	4528	May 5, 2022
General cause/scenarios for `worker-handle-scheduler-connection-broken` error Distributed dask-gateway , distributed	8	1205	November 3, 2023
Embarrassingly parallel dask cannot handle large amount of task? Distributed	2	183	January 16, 2023

Weird holes in Dask Distributed status - how to investigate?

Related topics