Hello!

I am experimenting with using dask for parallelizing the data-download → data prep → feature extraction → train workflow.

Let’s say our cluster has 6 nodes. I’d like to submit two types of tasks to this cluster,

- Task 1: data-download → data prep → feature extraction for 4 different model types. This task is fast and repeated periodically as new data is available.
- Task 2: ML pipeline (normalization, outlier detection/removal, training, model saving) for a single model type. This task is slow and is repeated as often as possible.

What is the most efficient way to store the features created by Task 1 to be partially consumed by Task 2? After studying the dask documentation, I have found that there are two obvious solutions:

*Slowest and simplest*:

- Solution 1: write the features to disk at the end of Task 1, and then load the features from disk in Task 2.

*Closer to storing features in RAM locally*:

- Solution 2: Publish the data set to the cluster memory with
`client.publish_dataset()`

at the end of Task 1, then use`client.get_dataset()`

inside Task 2 to pull the features from cluster memory.

Meanwhile, there seems to be a third solution that I am hesitant to build due to complexity of data tracking:

*Most complex but likely most efficient*:

- Solution 3: Minimize data movement as much as possible according to Data Locality. Try to “prepare” nodes with Task 1, then send Task 2 to the prepared node which still holds the prepared features.

This brings up a variety of questions about `client.publish_dataset()`

. Is it meant to be used as described in Solution 2? Is the overhead associated with distributing the data across cluster memory and converting to/from Dask DataFrame less compared to simply loading to and from disk (Solution 1)?

Thanks for the advice,

Robert