Debugging Dask - Futures API

Hi,

We are switching our code-base to use dask (instead of an in-house framework) in order to be able to parallelize some heavy computations.

We parallelize at various levels of the algorithm, including starting a task from other tasks - the submitted task eventually calls worker_client() and submits additional tasks.

As part of our migration effort, I want to provide instructions on how to debug the code.

I have read a bit in the dask documentation and in stack overflow (e.g., this) but I could not get to debug with a synchronous scheduler so that I can easily step through the code.

There is a mention in the docs that the “single-threaded” / “synchronous” scheduler is not directly available for dask.distributed. I don’t mind not using distributed while debugging, if it is possible to still use the same api - with client.submit and worker_client(). Alternatively, is there a way to get (something like) a synchronous scheduler to work with the distributed client somehow?

As a side note - I tried to use dask.config.set(scheduler='synchronous') and a LocalCluster with single worker, but tasks are still running in parallel.

Thanks

Using:

cluster = LocalCluster(n_workers=1, threads_per_worker=1)
client = Client(cluster)

Does give apparently synchronous behavior…

1 Like

@tomercagan Thanks for the question!

including starting a task from other tasks

Please note that launching tasks from tasks isn’t reliably supported right now, also see this discussion about a redesign: Redesign discussion: Launch Tasks from Tasks · Issue #5671 · dask/distributed · GitHub

Alternatively, is there a way to get (something like) a synchronous scheduler to work with the distributed client somehow?

Not really, the synchronous scheduler is a “single machine” scheduler – separate from the distributed scheduler. Your workaround of using 1 worker with 1 thread is the only way I can think of too. :smile:

@pavithraes - thanks for the reply.

About task-from-task - I read the discussion, and was a bit worried, but after some playing with it, I felt it to work well enough. I hope I am not going to regret it.

I have a case similar to the example in the thread - a task (which I have few of and I want to run in parallel) has sub-tasks. In addition, it has some internal recursion (breaking the original to smaller tasks and retry). I can potentially do the outer parallelism with my orchestrator (jenkins) but prefer not to unless it is a must.

I don’t mind using a single machine scheduler when I want to do low-level debugging (of my algorithm, not dask). Does this scheduler support submit/map and futures or it is just the scheduler behind the scene for the higher-level APIs?

About task-from-task - I read the discussion, and was a bit worried, but after some playing with it, I felt it to work well enough. I hope I am not going to regret it.

Sounds good, and feel free to reach out if you do run into any issues. :sunflower:

Does this scheduler support submit/map and futures or it is just the scheduler behind the scene for the higher-level APIs?

submit/map and the Futures API are tied to the distributed scheduler, so you won’t be to use these on single machine schedulers. Also, tasks-from-tasks in itself is a distributed-scheduler-concept, so I think sticking with distributed while debugging makes sense in your case.