Hi,
We are switching our code-base to use dask (instead of an in-house framework) in order to be able to parallelize some heavy computations.
We parallelize at various levels of the algorithm, including starting a task from other tasks - the submitted task eventually calls worker_client()
and submits additional tasks.
As part of our migration effort, I want to provide instructions on how to debug the code.
I have read a bit in the dask documentation and in stack overflow (e.g., this) but I could not get to debug with a synchronous scheduler so that I can easily step through the code.
There is a mention in the docs that the “single-threaded” / “synchronous” scheduler is not directly available for dask.distributed. I don’t mind not using distributed while debugging, if it is possible to still use the same api - with client.submit
and worker_client()
. Alternatively, is there a way to get (something like) a synchronous scheduler to work with the distributed client somehow?
As a side note - I tried to use dask.config.set(scheduler='synchronous')
and a LocalCluster
with single worker, but tasks are still running in parallel.
Thanks
Using:
cluster = LocalCluster(n_workers=1, threads_per_worker=1)
client = Client(cluster)
Does give apparently synchronous behavior…
1 Like
@tomercagan Thanks for the question!
including starting a task from other tasks
Please note that launching tasks from tasks isn’t reliably supported right now, also see this discussion about a redesign: Redesign discussion: Launch Tasks from Tasks · Issue #5671 · dask/distributed · GitHub
Alternatively, is there a way to get (something like) a synchronous scheduler to work with the distributed client somehow?
Not really, the synchronous scheduler is a “single machine” scheduler – separate from the distributed scheduler. Your workaround of using 1 worker with 1 thread is the only way I can think of too.
@pavithraes - thanks for the reply.
About task-from-task - I read the discussion, and was a bit worried, but after some playing with it, I felt it to work well enough. I hope I am not going to regret it.
I have a case similar to the example in the thread - a task (which I have few of and I want to run in parallel) has sub-tasks. In addition, it has some internal recursion (breaking the original to smaller tasks and retry). I can potentially do the outer parallelism with my orchestrator (jenkins) but prefer not to unless it is a must.
I don’t mind using a single machine scheduler when I want to do low-level debugging (of my algorithm, not dask). Does this scheduler support submit/map and futures or it is just the scheduler behind the scene for the higher-level APIs?
About task-from-task - I read the discussion, and was a bit worried, but after some playing with it, I felt it to work well enough. I hope I am not going to regret it.
Sounds good, and feel free to reach out if you do run into any issues.
Does this scheduler support submit/map and futures or it is just the scheduler behind the scene for the higher-level APIs?
submit/map and the Futures API are tied to the distributed scheduler, so you won’t be to use these on single machine schedulers. Also, tasks-from-tasks in itself is a distributed-scheduler-concept, so I think sticking with distributed while debugging makes sense in your case.