Efficient compute for new data?

mike-lawrence · June 22, 2022, 5:50pm

Hi folks, I’ve historically been an R user, but dipping my toes into the python-for-data-science world for my work, and I’m wondering if Dask has built-in features to accomplish a particular workflow that’s critical for me. Specifically, I have data that accumulates “chunks” over time, and I have an R script that I run occasionally that enumerates the chunks, skips over those it’s already seen and running some processing on only the new ones, saving the results in a manner that makes it easy to collect all results for subsequent interactive processing.

One additional thing my R-based system achieves is that the processing of a given chunk actually involves multiple steps, and I have a mechanism for keeping a history of inputs/functions/outputs at each step so that should the overall processing of a chunk fail at any point, I can see precisely where and work out solutions to those corner cases (note: there’s an R package, targets, that achieves a lot of this as an off-the-shelf package, but I ended up finding it’s strict reproducibility philosophy an impediment to efficiency for quick corner-case handling).

It’d be fairly straightforward for me to simply translate my R code to python directly, but it seems like a common data science workflow so I thought I should check if Dask has tools for it already.

scharlottej13 · June 23, 2022, 4:35pm

Hi @mike-lawrence and welcome to discourse!

I am also an R user, so I might be able to help. Would you mind sharing some of the R code you’re currently using? Seeing your flow for “skips over those it’s already seen” would be particularly helpful.

I took a quick look at the targets R package, it seems it broadly helps with data pipeline orchestration? If you haven’t already, you. might take a look at Prefect, which also works well with Dask.

Topic		Replies	Views
Overlapping Computations and Cloning Dask Array dask-array , distributed	11	692	May 2, 2023
Dask suitability for my use case	1	65	February 16, 2024
Optimizing Dask Delayed Pandas DataFrames for Large-Scale Data Processing - Emmanuel Katto Dask DataFrame delayed	3	78	September 19, 2024
Data reusing across different computations	4	879	March 22, 2022
Dask team releases version 2022.03.0 Announcements	0	260	March 18, 2022

Efficient compute for new data?

Related topics