I am a newbee to dask and I am evaluating whether dask will be the best framework to fit my use case. Let me describe what I am trying to achieve:
-
I have an existing python application where I am executing a DAG of nodes, currently the execution is done within a single process but I am finding this quite slow and I want to execute this over a cluster of machines.
-
Each node in the DAG is a python function that takes some inputs and creates an output. inputs/outputs are python data class objects.
-
Also in a few of the python functions (node) there is a need to further parallelise certain operations. As a concrete example, I want to replace a pandas dataframe with the dask equivalent so that I can speed up the operation even further.
I would like to understand whether I can use dask to convert my DAG nodes to dask.delayed tasks to execute nodes in a distributed manner and parallelise their execution while also using dask containers like dataframes to further parallelize certain operations in some of the DAG nodes. From what I read online this does not seem to be supported. If not, could you suggest any alternatives? Thank you!