First post here. Looking for some advice on basic architecture to use separate containerized dask schedulers as nodes in a DAG task graph. I think these containers will have to be persistent and stateful (ie not functional), or occasionally be recreated upon dynamically created tasks from other containers.
Currently each node in this DAG contains its own DAG of tasks, on which dask distributed operates. There may be a future need to link together tasks that must run in separate containers, either for security, or because some tasks use incompatible environments, and, as I understand it, dask schedulers require identical environments.
I would like to outline a general path forward for using separate containerized dasks in this manner. I think I would have a single process – call it main – sending tasks to these containers, which would be submitted by dask, and the individual containers would send the completed tasks to the main process, which would send these results as dependencies to other containers. Though attempts would be made to define the dependencies statically, the containers must be able to occasionally send dynamic requests for other containers to main. This makes me think that the main process has to send / receive asynchronously. Otherwise I might traverse the DAG in post order once and be done.
The communication would be intentionally be kept short and easily serialized. The actual data dependencies would, I guess, be stored on a shared filesystem.
Most dependencies would / should be created within a container, where that dask scheduler could handle them.
The actual physical architecture could vary – what one typically finds in research labs. It could be a single workstation with multiple processors (what I am most familiar with), or dask distributed could be running on a cluster, so each container running a cluster?, or each container as a node on a cluster, or using cloud?
I apologize if this an unreasonable question. I don’t have the knowledge to understand what might be an appropriate general solution that could be adapted to these various scenarios.
I thought I might write a simple network server – main process and the containers communicate through this server, along with some simple heuristic to schedule the usually static but possibly dynamic requests between containers. I would likely do an initial descent of the DAG (and within each DAG, a DAG) to estimate running times of each path – the main process would then have access to that entire structure – the DAG of DAGs.
I’ve never used celery or airflow, but I wondered if either of these or some other scheduler might provide a general, adaptable solution.
I used the kubernetes tag, but probably would default toward docker at first. I don’t have time to get into the innards of either right now, but would like to have some lighweight structure that would allow this larger linking of containers of separate dask schedulers that could be refined or adapted later on.