Hi folks, I’ve historically been an R user, but dipping my toes into the python-for-data-science world for my work, and I’m wondering if Dask has built-in features to accomplish a particular workflow that’s critical for me. Specifically, I have data that accumulates “chunks” over time, and I have an R script that I run occasionally that enumerates the chunks, skips over those it’s already seen and running some processing on only the new ones, saving the results in a manner that makes it easy to collect all results for subsequent interactive processing.
One additional thing my R-based system achieves is that the processing of a given chunk actually involves multiple steps, and I have a mechanism for keeping a history of inputs/functions/outputs at each step so that should the overall processing of a chunk fail at any point, I can see precisely where and work out solutions to those corner cases (note: there’s an R package, targets, that achieves a lot of this as an off-the-shelf package, but I ended up finding it’s strict reproducibility philosophy an impediment to efficiency for quick corner-case handling).
It’d be fairly straightforward for me to simply translate my R code to python directly, but it seems like a common data science workflow so I thought I should check if Dask has tools for it already.