Efficient compute for new data?

Hi folks, I’ve historically been an R user, but dipping my toes into the python-for-data-science world for my work, and I’m wondering if Dask has built-in features to accomplish a particular workflow that’s critical for me. Specifically, I have data that accumulates “chunks” over time, and I have an R script that I run occasionally that enumerates the chunks, skips over those it’s already seen and running some processing on only the new ones, saving the results in a manner that makes it easy to collect all results for subsequent interactive processing.

One additional thing my R-based system achieves is that the processing of a given chunk actually involves multiple steps, and I have a mechanism for keeping a history of inputs/functions/outputs at each step so that should the overall processing of a chunk fail at any point, I can see precisely where and work out solutions to those corner cases (note: there’s an R package, targets, that achieves a lot of this as an off-the-shelf package, but I ended up finding it’s strict reproducibility philosophy an impediment to efficiency for quick corner-case handling).

It’d be fairly straightforward for me to simply translate my R code to python directly, but it seems like a common data science workflow so I thought I should check if Dask has tools for it already.

Hi @mike-lawrence and welcome to discourse!

I am also an R user, so I might be able to help. Would you mind sharing some of the R code you’re currently using? Seeing your flow for “skips over those it’s already seen” would be particularly helpful.

I took a quick look at the targets R package, it seems it broadly helps with data pipeline orchestration? If you haven’t already, you. might take a look at Prefect, which also works well with Dask.

1 Like